Autonomous Learning for Face Recognition in the Wild via Ambient Wireless Cues

Facial recognition is a key enabling component for emerging Internet of Things (IoT) services such as smart homes or responsive offices. Through the use of deep neural networks, facial recognition has achieved excellent performance. However, this is only possibly when trained with hundreds of images of each user in different viewing and lighting conditions. Clearly, this level of effort in enrolment and labelling is impossible for wide-spread deployment and adoption. Inspired by the fact that most people carry smart wireless devices with them, e.g. smartphones, we propose to use this wireless identifier as a supervisory label. This allows us to curate a dataset of facial images that are unique to a certain domain e.g. a set of people in a particular office. This custom corpus can then be used to finetune existing pre-trained models e.g. FaceNet. However, due to the vagaries of wireless propagation in buildings, the supervisory labels are noisy and weak.We propose a novel technique, AutoTune, which learns and refines the association between a face and wireless identifier over time, by increasing the inter-cluster separation and minimizing the intra-cluster distance. Through extensive experiments with multiple users on two sites, we demonstrate the ability of AutoTune to design an environment-specific, continually evolving facial recognition system with entirely no user effort.


INTRODUCTION
Facial recognition and verification are key components of smart spaces, e.g., offices and buildings for determining who is where. Knowing this information allows a building management system to tailor ambient conditions to particular users, perform automated security (e.g., opening doors for the correct users without the need for a swipe card), and customize smart services (e.g., coffee dispensing). A vast amount of research over the past decades has gone into designing tailored systems for facial recognition and with the advent of deep learning, progress has accelerated. As an example of a state-of-the-art face recognizer, FaceNet achieves extremely high accuracies (e.g., 99.5%) on very challenging datasets through the use of a low dimensional embedding, allowing similar faces to be clustered through their Euclidean distance [14,26]. However, when directed transferred to operate in 'in the wild', subject to variable lighting conditions, viewing angle and appearance changes, performance of off-the-shelf pre-trained classifiers degrades significantly, with accuracies around 15% not being uncommon. The solution to this is to obtain a large, labelled corpus of data for a particular environment, with hundreds of annotated images per user. Given access to such a hypothetical dataset, it is then possible to fine-tune the pre-trained classifier on out-domain data to adapt to the new environment and achieve excellent performance.
However, the cost of labelling and updating the corpus (e.g. to enrol new users) is prohibitive for most critical applications and therefore, will naturally limit the use and uptake of facial recognition as a ubiquitous technology in emerging Internet of Things (IoT) applications. On the other side, people often, but not always, carry smart devices (phones, fitness devices etc). Wang et al. [36] advocated that although these devices do not provide fine-grained enough positioning capability to act as a proxy for presence, they can be used to indicate that a user might be present in an area with a co-located camera. In this work, we take a step forward and further utilize device presence as weak supervision signals for the purposes of fine-tuning a classifier. The goal now becomes how to take an arbitrary, pre-trained recognition network and tune it from a generic classifier to a highly specific classifier, optimized for a certain environment and group of people. We note that the aim is to make the network better and better at this specific goal, but it would likely perform poorly if transferred directly to a different environment. This is the antithesis of the conventional view of generalized machine learning, but is ideally suited for the problem of environment specific facial recognition, as opposed to generic facial recognition.
The technical challenge is that there is not a 1:1 mapping between a face and a wireless identity, rather we need to solve the association between a set of faces and a set of identities over many sessions or occasions. To further complicate the problem, the sets are not pure i.e. the set of faces can contain additional faces from people not of interest (e.g. visitors). Equally well, due to the vagaries of wireless transmission, the set of wireless identifiers will contain additional identifiers e.g. from people in the next office. Furthermore, it is also possible to have missing observations e.g. because a person was not facing the camera or because someone left their phone at home.
In this work, we present AutoTune, a system which can be used to gradually improve the performance of facial recognition systems in the wild, with zero user effort, tailoring them to the visual specifics of a particular smart space. We demonstrate state-of-theart performance in real-world facial recognition through a number of experiments and trials.
In particular, our contributions are: • We observe and prove that wireless signals of users' devices provide valuable, albeit noisy, clues for face recognition. Namely, wireless signals can serve as a weak label. Such weak labels can replace the human annotated face images in the wild to save intensive effort. • We create AutoTune, a novel pipeline to simultaneously label face images in the wild and adapt the pre-trained deep neural network to recognize the faces of users in new environments.
The key idea is to repeat the face-identity association and network update in tandem. To cope with observation noise, we propose a novel probabilistic framework in AutoTune and design a new stochastic center loss to enhance the robustness of network fine-tuning. • We deployed AutoTune in two real-world environments and experimental results demonstrate that AutoTune is able to achieve > 0.85 F 1 score of image labeling in both environments, outperforming the best competing approach by > 25%. Compared to the best competing approach, using the features extracted from the fine-tuned model and training a classifier based on the cross-modality labeled images can give a ∼ 19% performance gain for online face recognition.
The rest of this paper is organized as follows. §2 introduces the background of this work. System overview is given in §3. We describe the AutoTune solution in §4 and §5. System implementation details are given in §6. The proposed approach is evaluated and compared with state of the art methods in §7. Finally, we discuss and outlook future work in §8 and conclude in §9.

RELATED WORK
Deep face recognition: Face recognition is arguably one of the most active research areas in the past few years, with a vast corpus of face verification and recognition work [23,31,40]. With the advent of deep learning, progress has accelerated significantly. Here we briefly overview state-of-the art work in Deep Face Recognition (DFR). Taigman et al. pioneered this research area and proposed DeepFace [30]. It uses CNNs supervised by softmax loss, which essentially solves a multi-class classification problem. When introduced, DeepFace achieved the best performance on the Labeled Face in the Wild (LFW) [9] benchmark. Since then, many DFR systems have been proposed. In a series of papers [27,28] Sun et al. extended on DeepFace incrementally and steadily increased the recognition performance. A critical point in DFR happened in 2015, when researchers from Google [26] used a massive dataset of 200 million face identities and 800 million image face pairs to train a CNN called Facenet, which largely outperformed prior art on the LFW benchmark when introduced. A point of difference is in their use of a "triplet-based" loss [5], that guides the network to learn both inter-class dispersion and inner-class compactness. Recently proposed RTFace [36] not only achieves high recognition accuracy but also operates at the full frame rates of videos. Although the above methods have proven remarkably effective in face recognition, the training needs a vast amount of labeled images to train the supervised DFR network. A large amount of labeled data is not always achievable in a particular domain, and using a small amount of training data will incur poor generalization ability in the wild. Cross-modality Matching: Cross-modal matching has received considerable attention in different research areas. Methods have been developed to establish mappings from images [7,11,34] and videos [33] to textual descriptions (e.g., captioning), developing image representation from sounds [20,21], recognizing speaker identities from Google calendar information [17], and generating visual models from text [41]. In cross-modality matching between images and radio signals, however, work is very limited and all dedicated to trajectory tracking of humans [1,22,32]. The field of face recognition via wireless signals is an unexplored area.

AUTOTUNE OVERVIEW 3.1 System Model
We consider a face recognition problem with m people of interest (POI) and each subject owns one WiFi-enabled device, e.g., a smartphone. We denote the identity set as Y = {y j |j = 1, 2, . . . , m}. The set of observed POI's devices in a particular environment is denoted by L = {l j |j = 1, 2, . . . , m}, e.g., a set of MAC addresses. We assume the mapping from device MAC addresses L to the user identity I is known. A collection of face images X = {x j |j = 1, 2, . . . , n} is cropped from the surveillance videos in the same environment. Note that, to mimic the real-world complexity, our collection includes the faces of both POI and some non-POI, e.g., subjects with unknown device MAC addresses. We then assign face and device observations to different subsets based on their belonging events E = {e j |j = 1, 2, . . . , h}. An event e j is the setting in which people interact with each other in a specific part of the environment for a given time interval. It is uniquely identified by three attributes: effective timeslot, location, and participants. Fig. 1   few examples of events. Lastly, we also have a deep face representation model f θ pre-trained on public datasets that contains no POI. Such model is trained with metric losses, e.g., triplet loss so that the learned features could bear a good property for clustering [26].
In this sense, the problem addressed by AutoTune is assigning IDs to images from noisy observations of images and WiFi MAC addresses, and using such learned ID-image associations to tune the pre-trained deep face representation model automatically.

System Architecture
AutoTune is based on two key observations: i) although collected by different modalities, both face images and device MAC addresses are linked with the identities of users who attend certain events; and ii) the tasks of model tuning and face-identity association should not be dealt with separately, but rather progress in tandem. Based on the above insights, AutoTune works as follows (see Fig 2): • Heterogeneous Data Sensing. This module collects facial and WiFi data (device attendance observations) through surveillance cameras and WiFi sniffers 1 in a target environment. Given the face images and sniffed WiFi MAC addresses, AutoTune first segments them into events based on the time and location they were captured. • Cross-modality Labeling. This module first clusters face images based on their appearance similarity computed by the face representation model, and also taking into account information on device attendance in events. Each image cluster should broadly correspond to a user, and the cluster's images are drawn from a set of events. We then assign each cluster to the user whose device has been detected in as similar as possible set of events. • Model Updates. Once images are labeled with user identity labels, this module then fine-tunes the pre-trained face representation model. AutoTune further uses the cluster labels to update our belief on which device (MAC address) has participated in each event. The sensing module is one-off and we will detail its implementation in §6.1. Labeling and model update modules are iteratively repeated until the changes in the user attendance model become 1 https://www.wireshark.org/ negligible. The tuned model derived in the last iteration is regarded as the one best adapted to POI recognition in the new environment.

CROSS-MODALITY LABELING
In this section, we introduce the labeling module in AutoTune. The challenge in this module is that collected facial images and sniffed WiFi data are temporally unaligned. For example, detecting a device WiFi address does not imply that the device owner will be captured by the camera at the exact instant and vice versa. Such mismatches in cross-modality data distinguish our problem from prior sensor fusion problems, where both multiple sensors are observing a temporal evolving system. In order to tackle the above challenge, we leverage the diverse attendance patterns in events and use a two-step procedure in isolated. Images X are firstly grouped together into clusters across all sessions, and we then associate clusters with device IDs (i.e., labels) L based on their similarity in terms of event attendance.

Cross-modality Clustering
Heterogeneous Features. Given a pre-trained face representation model f θ , an image x i can be translated to the feature vector z i . Unlike conventional clustering that merely depends on the face feature similarity, AutoTune merges face images across events into a cluster (potentially belonging to the same subject) by incorporating attendance information as well. Recall that device attendance already reveals the identities of subjects (in the form of MAC addresses) in a particular event, and the captured images in the same event may contain the faces of these subjects as well. Despite the noise in observations, the overlapped subjects in different events can be employed as a prior that guides the image clustering. For example, if there are no shared MAC addresses sniffed in two events, then it is very likely that the face images captured in these two events should lie in different clusters. Formally, for an event e i , we denote a device attendance vector as In this way, we could construct a heterogeneous feature z i = [z i , u k ] for an image x i collected in the event e k . Note that, as all images captured in a same event have the same attendance vector, this device attendance part are essentially enforced on cross-event image comparison. The kth element of the vector r k c i is set to 1 only if it contains images attached with the event e k . The event vector r l j of the device ID l j is similarly developed. This insight leads to Eq. 1.
Face Image Similarity. Given an image x i captured in event e k (i.e., z i = [z i , u k ]) and an image x j captured in event e p (i.e., z j = [z j , u p ]), the likelihood that two cross-event face images belong to the same subject is conditioned on two factors: i) the similarity of their feature representation between z i and z j ; and ii) the overlap ratio between the attendance observations in their corresponding events u k and u p respectively. The resulted joint similarity is a log likelihood loд(Pr(x i = x j )) defined as follows: Here ⊗ and ⊕ are element-wise AND and OR, and | · | here is the L 1 -norm. z is the features transformed by the face representation model and D is a distance measure between face features. β, analogous to the regularization parameter in composite loss functions, is a hyper-parameter that controls the contributions of the attendance assistance and feature similarity. The above derivation is inspired by the Jaccard coefficients, with the difference lying in the log function. The rationale behind the term |u k ⊕ u p | is that the more different subjects attending events, the more uncertain that any two images drawn across these events will point to the same subject. In contrast, when the intersection |u k ⊗ u p | is significant enough, the chance that these two images point to the same subject will become higher. This joint similarity can also be explained from a Bayesian perspective. The attendance similarity of two events can serve as a prior that two cross-event images belong to the same subject and the feature similarity can be seen as the likelihood. Together they determine the posterior probability that two cross-event images fall into the same cluster. Based on the above joint similarity, images across events are grouped into clusters C = {c i |i = 1, 2, . . . , д}. We will soon discuss how to determine the number of clusters д based on the complete set of MAC addresses L in the next section.

Cluster Labeling
Fine-grained ID Association. After clustering, an image cluster is linked with multiple events that are associated with its member images. Naturally, we introduce an event vector r c i = (r 1 c i , r 2 c i , . . . , r h c i ) for for an image cluster c i , where h is the total number of events. r k c i is set to 1 only if c i contains images from event e k . Fig. 3 shows an example of how an event vector is developed. Similarly, for a device ID l j , its corresponding event vector r l j = (r 1 can be determined by inspecting its occurrences in all WiFi sniffing observations. r k l j is set to 1 only if the device ID (MAC address) l j is detected in the event e k . The intuition behind ID association is that a device and a face image cluster of the same subject should share the most consistent attendance pattern in events, reflected by the similarity of their event vectors. Based on this intuition, AutoTune assigns clusters with device IDs based on the matching level of their event vectors. Formally, a matching problem of bipartite graph can be formulated as follows: where the solution of the binary variable a i j assigns a device ID to a cluster. We note that when m < д, AutoTune adds dummy nodes to create the complete bipartite graph. Then the complete bipartite graph problem can be solved by the Hungarian algorithm [10]. Probabilistic Labeling via Soft Voting. We now obtain the association between face images and device IDs. However, in practice, the number of clusters д can vary due to the captured face images outside the POI, e.g., face images of short-term visitors. The choice of д has a significant impact on the performance of clustering [6] which could further affect the following fine-tuning performance.
To cope with this issue, AutoTune sets the number of clusters д greater than the number of POI m. With a larger number of clusters, although there are some unused clusters of non-POI after clustering, the association step will sieve them and only choose the most consistent m clusters with the event logs of m device IDs. Lastly, for every image, its assigned ID is finalized by decision voting on the individual association results computed with different д, which will be soon introduced in the next paragraph. Typically, a majority voting procedure would give a hard ID label for an image and use these <Image, ID> pairs as training data to update the face representation model. However, due to the noise in device presence observations and errors in clustering, there will be mis-labeling in the training data that may confuse the model update. To account for the uncertainty of ID assignment, we adopt soft labels for voting. Then instead of voting for the most likely ID y i for an image x i , we introduce a probability vector y i = (y i,1 , y i,2 , . . . , y i,m ). Specifically, every image is associated with all POI and its associated soft label is derived by the votes for the subjects divided by the total number of votes. In this way, the soft label of each image is a valid probability distribution that sums up to 1. For instance, y i, j = 0.4 means that there are 40% associations assigning the ID j to the image i. Moreover, the soft labeled data < x i , y i > is compatible with the computation of cross-entropy.

Number of Clusters.
We are now in a position to describe how to select the number of clusters д. We observed that, with a proper clustering algorithm (i.e., agglomerative clustering in our choice), non-POI's images will form separate small clusters but they will not be chosen in our follow-up association step. This is because that the number of devices ID m is always less than the number of clusters (д ∈ [2 * m, 5 * m]), and only those images clusters with the most consistent event attendances are associated. The small clusters of outlier images often have scatted attendance vector and will be ignored in association. Therefore, soft voting with varying number of clusters can reinforce the core images of POI by giving large probability and assigns non-POI's images with small probability.

MODEL UPDATES
We are now in a position to introduce the model update in AutoTune.
At every iteration, AutoTune updates the last-iteration face representation model f τ θ to f τ +1 θ , by taking the labeled images as inputs. To correct the device observation errors, AutoTune leverages these label images to update device attendance vector to u τ for all events.

Visual Model Update
Discriminative Face Representation Learning. Face representation learning optimizes a representation loss L R to enforce the learnt features as discriminative as possible. Strong discrimination bears two properties: inter-class dispersion and intra-class compactness. Inter-class dispersion pushes face images of different subjects away from one another and the intra-class compactness pulls the face images of the same subject together. Both of the properties are critical to face recognition. At iteration τ , given the current labels y τ i for the ith face image x i and the transformed features z τ i = f τ θ (x i ), the representation loss L R is determined by a composition of softmax loss and center loss: where W and b are the weights and bias parameters in the last fully connected layer of the pre-trained model. o y τ i denotes a centroid feature vectors by averaging all feature vectors with the same identity label y i . The center loss L cent er explicitly enhances the intra-class compactness while the inter-class dispersion is implicitly strengthened by the softmax loss L sof tmax [37]. λ is a hyper-parameter that balances the above sub-losses.

Stochastic Center Loss
The center loss L cent er in Eq. 2 is shown to be helpful to enhance the intra-class compactness [37]. However, we cannot directly adopt it for fine-tuning as computing the centers requires explicit labels (see Eq. 2) of images, while the association steps above only provide probabilistic ones through soft labels. To solve this, we propose a new loss called stochastic center loss L stoc to replace the center loss. Similar to the idea of fuzzy sets [38], we allow each face image to belong to more than one subject. The membership grades indicate the degree to which an image belongs to each subject and can be directly retrieved from the soft labels and the stochastic center o τ k for the k-th identity is given as: This gives the stochastic center loss as follows: We leave the softmax loss L sof tmax the same as in Eq. 2, because the soft labels are compatible with the computation of cross-entropy. The new representation loss to minimize is: AutoTune updates the model parameters θ τ to θ τ +1 based on the gradients of ∇ θ L R , which are calculated via back prorogation of errors. Compared with the dataset used for pre-training, which is usually in the order of millions [4], the data used for fine-tuning is much smaller (several thousands). The mis-match between the small training data and the complex model architecture could result in overfitting. To prevent overfitting, we use the dropout mechanism in training, which is widely adopted to avoid overfitting [13].
Meanwhile, as observed in [8,25], the soft label itself can play the role of a regularizer, and make the trained model robust to noise and reduce overfitting. Fig. 4 illustrates the effect of model update.

User Attendance Update
The device presence observations by WiFi sniffing are noisy because the WiFi signal of a device is opportunistic and people do not carry/use theirs devices all the time. Based on the results of the ALGORITHM 1: AutoTune Input: pre-trained model f 1 θ , images X, POI's device IDs L, threshold ξ Output: adapted model f * θ , corrected attendance observations I * , soft image labels Y * Initialize: Given sniffed L in all events E, compute attendance vector u 0 τ = 1 cluster labelling step introduced in §4.2, we have the opportunity to update our belief on which users attended each event. The update mechanism is as follows: Each image is associated with a user probability vector, whose elements denote the probability that the image correspondences to a particular user. By averaging the user probability vectors of all images that have been drawn from the same event e k , and normalizing the result, we can estimate the user attendance of this event. The elements of the resulting user attendance vector u τ k denote the probabilities of different users attending event e k .
We can now use u τ k as a correction to update our previous wifi attendance vector u τ k as follows: where γ is a pre-defined parameter that controls the ID update rate. In principle, a large update rate will speed up the convergence rate, at the risk of missing the optima. AutoTune sequentially repeats the above steps of clustering, labelling and model updates, until the changes ξ in the user attendance model are negligible (≤ 0.01 in our case). Algorithm. 1 summarizes the workflow.

IMPLEMENTATION
In this section, we introduce the implementation details of AutoTune (code available at https://github.com/Wayfear/Autotune).

Heterogeneous Data Sensing
Face Extraction. This module consists of a front-end remote camera and a back-end computation server 2 . Specifically, the remote cameras in our experiment are diverse and include GoPro Hero 4 3 , Mi Smart Camera 4 and Raspberry Pi Camera 5 ). We modified these cameras so that they are able to communicate and transfer data to the back-end through a wireless network. To avoid capturing excess data without people in it, we consider a motion-triggered mechnism with a circular buffer. It works by continuously taking low-resolution images, and comparing them to one another for changes caused by something moving in the camera's field of view. When a change is detected, the camera takes a higher-resolution video for 5 seconds and reverts to low resolution capturing. All the collected videos are sent to the backend at every midnight. On the backend, a cascaded convolutional network based face detection module [39] is used to drop videos with no face in them. The cropped faces from the remaining videos are supplied to AutoTune. WiFi Sniffing. This module is realized on a WiFi-enabled laptop running Ubuntu 14.04. Our sniffer uses Aircrack-ng 6 and tshark 7 to opportunistically capture the WiFi packets in the vicinity. The captured packet has unencrypted information such as transmission time, source MAC address and the Received Signal Strengths (RSS). As AutoTune aims to label face images for POI, our WiFi sniffer only records the packets containing MAC addresses of POI's and discards them otherwise, so as to not harvest addresses from people who have not given consent. A channel hop mechanism is used in the sniffing module to cope with cases where the POI's device(s) may connect to different WiFi networks, namely, on different wireless channels. The channel hop mechanism forces the sniffing channel to change by every second and monitor the active channels periodically (1 second) in the environment. The RSS value in the packet implies how far away the sniffed device is from the sniffer [18,19]. By putting the sniffer near the camera, we can use a threshold to filter out those devices with low RSS values, e.g., less than -55 dBm in this work, as they are empirically unlikely to be within the camera's field of view. Event Segmentation. Depending on the context, the duration of events can be variable. However, for simplicity we use fix-duration events in this work. Specifically, we split a day into 12 intervals, each of which is 2 hours long. We then discard those events that have neither face images nor sniffed MAC addresses of POI.

Face Recognition Model
The face recogonition model used in AutoTune is the state-of-theart FaceNet [26]. Pre-training. FaceNet adopts the Inception-ResNet-v1 [29] as its backbone and its weights are pre-trained on the VGGFace2 dataset [4]. This dataset contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession. Pre-training is supervised by the triplet loss [5] and the training protocols, e.g, parameter settings, can be found in [26]. We found that the learnt face representation by FaceNet is generalizable and it is able to achieve an accuracy of 99.65% on the LFW face verification task 8 . Note that, FaceNet [26] not only learns a powerful recognition model but gives a very discriminative feature representation that can be used for clustering. Fine-tuning by AutoTune. The fine-tuning process has been explained in §5.1, and here we provide some key implementation detials. After each round of label association (see §4.2), the labeled data is split into a training set and validation set, with a ratio of 8 : 2 respectively. The pre-trained FaceNet is then fine-tuned on the training set, and the model that achieves the best performance on the validation set is saved. Note that the fine-tuning process in AutoTune does not involve the test set. The online testing is performed on a held-out set that is collected on different days. To enhance the generalization ability, we use dropout training for regularization [35]. The dropout ratio is set to 0.2. We set the batch size to 50 and one fine-tuning takes 100 epochs.

System Configuration
Face Detection. As discussed in §6.1, we use a cascaded convolution network to detect faces in videos. It is cascaded by three sub-networks, a proposal network, a refine network and an output network. Each of them can output bounding boxes of potential faces and the corresponding detection probabilities, i.e., confidences.
Face detection with small confidence will be discarded early in the process and not sent to the next sub-network. In this work, the confidence threshold is set to 0.7, 0.7 and 0.9 for three sub-networks respectively. Following the original setting in [39], we set the minimal face size in detection to 40 × 40 pixels. Setup of Face Clustering. In §4.1, we use a clustering algorithm to merge the images across events. Specifically, the clustering algorithm used in AutoTune is agglomerative clustering [2], which is a method that recursively merges the pair of clusters that minimally increases a given linkage distance. The similarity metric adopted here is Euclidean distance. A linkage criterion determines which distance to use between sets of data points. In AutoTune, this linkage criterion is set to the average distances of all sample pairs from two sets. As introduced in Sec. 4.2, the number of clusters д is determined by the number of POI m. We vary д from 2 * m to 5 * m and proceed soft voting, to account for the extra clusters of non-POI's face images and ambiguous/outlier images of POI.

EVALUATION
In this section, we evaluate the AutoTune extensively on datasets collected from both real-world experiments and simulation. We deployed two testbeds, one in the UK and the other in China, and collected datasets as described in the previous section. The simulation dataset is developed based on a public face datasets.

Evaluation Protocols
Competing Approaches. We compare the performance of AutoTune with the 3 competing approaches: • Template Matching (TM) [3] employs a template matching method to assign ID labels to clusters of face images. This is the most straightforward method that is used when one or more profile photos of POI are available, e.g., crawled from their personal homepage or Facebook. • One-off Association (OA) [16] uses one-off associations to directly label the image clusters without fine-tuning of the face representation model itself. • Deterministic AutoTune (D-AutoTune) is the deterministic version of AutoTune. In D-AutoTune, the association and update steps are the same as in AutoTune, but it adopts hard labels rather than soft labels and uses the simple center loss instead of the proposed stochastic center loss (see §5.1).
Evaluation Metrics. AutoTune contains two main components, offline label assignment and online inference. For the offline label assignment, we evaluate its performance with the following metrics: TP, TN, FP, FN are true positive, true negative, false positive and false negative respectively. Each metric captures different aspects of classification [24]. Online face recognition has two kinds of tests, face identification and verification. We follow [14,15] to use Cumulative Match Characteristic (CMC) for the evaluation of online face identification.

Offline Cross-modality Face Labeling
AutoTune automatically labels images captured in the wild by exploiting their correlations with the device presences. The quality of image labeling is crucial for the follow-up face recognition in the online stage. In this section, we investigate the image labeling performance of AutoTune.

Data Collection.
We deployed AutoTune at the testbeds in two countries, with different challenging aspects. UK Site: This first dataset is collected in a commercial building in the UK. We deploy the heterogeneous sensing front-ends, including surveillance cameras and WiFi sniffers, on a floor with three different types of rooms: office, meeting room and kitchen. 24 long-term occupants work inside can freely transit across these rooms. These occupants are naturally chosen as people of interest (POI). For the office, face images are captured with a surveillance camera that faces the entrance. The presence logs of occupants' WiFi MAC addresses are collected by a sniffer that is situated in the center of the room for the same time period. Besides the POI's faces, these images also contain the faces of 11 short-terms visitors who came to this floor during the experiments. We put different cameras in different rooms to examine the performance of AutoTune under camera heterogeneity. To further examine the resilience of AutoTune, we put cameras in adversarial positions. In kitchen, we deploy cameras with bad views near entrance so that they can only capture subjects above 1.7m. While in the meeting room with two entrances, only the primary entrance is equipped with cameras. Therefore in both rooms cameras constantly mis-capture face images of subjects. Tab. 1 summarizes this data collection. CHN Site: We collect another dataset in a common room of a university in China. There are no long-term occupants in this site and all undergraduates can enter. Of the 37 people that appeared during the three week period, 12 subjects are selected as the POI, and their WiFi MAC address presence is continuously recorded by the sniffer. Other settings remain the same as the UK site. The challenge in this dataset lies in that the captured face images, both for POI and non-POI, are all of Asian people, while the initial face representation model is trained primarily on Caucasians. Details of the CHN dataset are given in Tab. 1. Observation noises are very common in this dataset.

Overall
Labeling Performance. We start our evaluation with one room only and compare our results with baselines. Fig. 5 shows the performance of label assignment, i.e., matching an identifier to a face image. For the office dataset, AutoTune outperforms the best competing approach (OA), by 0.13 in F 1 score and 7% in accuracy. The advantage of AutoTune is more obvious in the CommonRoom  experiment where it beats the best competing approach (OA) by 0.34 in F 1 score and 10% in accuracy. As the only method that uses the website images (one-shot learning) to label images rather than the device ID information, TM struggles in both experiments and is 9-fold worse than AutoTune. We observe that the website face images are dramatically different from the captured images in real-world, due to different shooting conditions and image quality. These results imply that, although the device observations via WiFi sniffing are noisy, when the amount of the them is enough, they are more informative than the web-crawled face images. Additionally, we note that adopting soft labels (see §4.2) can further improve the labeling performance. In terms of F 1 score, the full-suite AutoTune is around 15% and 22% better than D-AutoTune in the two experiments respectively. Similar improvements are witnessed in terms of the accuracy. The reason of the larger performance gap in the CommonRoom dataset is that there are more images of non-POI being captured in this experiment. In addition, the pre-trained FaceNet model does not generalize very well to Asian faces, which are the primary ethnicity in the CommonRoom.
Lastly, we found that the choice of algorithm for cross-event clustering affects the final label-association accuracy. The best performance is achieved with the default agglomerative clustering algorithm on both datasets. The second-best spectral clustering (SC) is slightly inferior to agglomerative clustering in the office dataset, though the gap between them gets large in the Common-Room dataset. The Gaussian Mixture Model (GMM) is inferior to the other two clustering algorithms. This is because GMM is best suited to Mahalanobis-distance based clustering whereas the distance space defined in Eq. 4.1 is a non-flat space due to the introduced attendance information. Hence, its resultant clusters are very impure and give poor association performance.

Performance vs. Scalability.
We further examine AutoTune with multiple rooms under different adversarial conditions. Together with the UK office data, we evaluate AutoTune when events are collected in three different locations and via heterogeneous cameras. In particular, a kitchen and a meeting room are included and each of them has two lower-fidelity cameras (Mi Home Camera and Raspberry Pi Camera). The setup of multi-location experiments is given in Tab. 1. As described in §7.2.1, there are many face misdetections in the kitchen and meeting room due to the adversarial camera setups. Nevertheless, as shown in Fig. 6, AutoTune only suffers little performance drop (≤ 0.03) when erroneous camera observations are mixed with the single-office data. In terms of F1 score, which is the most important metric, AutoTune can achieve comparable level with the office experiment, regardless of which camera is added. Moreover, AutoTune still maintains its good performance even in the most adversarial case, where both errors are mixed in (three-room experiment). The main reason is that by using more data, though noisy, extra validation constraints are also utilized by AutoTune and make itself robust to observation errors.

Performance vs. Lifespan.
AutoTune exploits co-located device and face presence in different events to establish the crossmodality correlations. In this section, we investigate the impact of the collection span on the performance of labeling. Longer collection days give more events. We investigate its impact by feeding (b) CHN Figure 8: Impact of ID update rate β (introduced in §5.2). It can be seen that a high update rate causes the network to converge rapidly to potentially incorrect assignments.
AutoTune with data collected in different number of days, and compare them with all days on two datasets respectively. Fig. 7 shows that AutoTune performs better with increasing number of days on both datasets. The gap of F 1 score between the case with all days (10 days) and case with the least amount of days (2 days) can be as large as > 0.32 on both datasets. As discussed in §4.2, the ID association needs sufficiently diverse events to create discriminative enough event vectors. Otherwise, there will be faces or devices with the same event vectors that hinders AutoTune's ability to disambiguate their mutual association. However, we also observe that when we collect more than 8 days, the performance improvement of AutoTune becomes marginal.
7.2.5 Impact of ID update rate β. This section investigates the impact of ID update rate β introduced in §5.2. The ID update uses the fine-tuned face recognition model to update the device ID observations. A large ID update rate forces the device ID observations to quickly become consistent with the deep face model predictions. However, a large update rate also runs the risk of missing the optima. We vary the update rate β from 0.05 to 0.20 at a step length of 0.05. Fig. 8a demonstrates that AutoTune achieves the best performance on the UK dataset when the update rate is set to 0.05. The performance declines by 10% when the rate rises to 0.2. This is because the updated ID observations quickly become the same as the model predictions and the model predictions are not quite correct yet. When it comes to CHN dataset (see Fig. 8b), similar trend of F 1 score change can be seen. Although overall, the convergence becomes faster when the update increases, we observe that there is a fluctuation point at the update rate of 0.15, where AutoTune takes 7 iterations to converge. By inspecting the optimization process, we found that, under this parameter setting, AutoTune oscillated because the large update step makes it jump around in the vicinity of the optima but it is unable to approach it furthermore. In practice, we suggest users of AutoTune to select their update rate from a relatively safe region between 0.05 to 0.1. 7.3.2 Performance. Fig. 9 compares the face identification results of both AutoTune and the competing approaches. Face identification is a multi-class classification task and aims to find an unknown person in a dataset of POI. As we can see, by using the face database developed by AutoTune, the identification accuracy quickly saturates and is able to have no errors within three guesses (rank-3) on both datasets. For the UK dataset where there are 20 POI, the rank-1 accuracy of AutoTune can be as high as 95.8% and outperforms the best competing approach (OA) by 12.5%. The advantage of AutoTune is more significant on the CHN dataset, and it surpasses the best competing approach OA by ∼ 19% (98.0% vs 79.1%). In addition, although D-AutoTune's performance is inferior to AutoTune, it is still more accurate than competing approaches, especially on the UK dataset. We note that these results are consistent with face labeling results in §7.2.2. Overall, face identification by AutoTune are highly accurate, considering that AutoTune is only supervised by the weak and noisy device presence information.

Sensitivity Analysis
To further study the performance of AutoTune under different noise conditions, we also conduct extensive sensitivity analysis via simulation, considering three types of noise that are common in real-world settings: False-alarm Faces is the case that the number of detected faces of POI is greater than the number of distinct sniffed MAC addresses. False-alarm Devices is the case that the number of detected faces of POI is smaller than the number of distinct sniffed MAC addresses. Non-POI Disturbance is the case that there are detected faces that belong to a non-POI, whose MAC address are not in our database i.e. they should be discarded as an outlier. converge_iter.
(c) Non-POI Disturbance Figure 10: Results on simulation data showing the impact of sources of noise dataset realistic. We then assign to these "events" noisy identity observations to simulate device presence information. Based on different types of noise, different synthetic datasets are generated on which AutoTune is examined. On average, each subject in our simulation has 168 images after data augmentation [12]. As there is no actual device presence, we simulate this by first randomly placing the face images into different "events", in which multiple subjects "attend". As the number of images of subjects might be skewed in events, we adopt the F 1 score as the metric to evaluate the AutoTune's performance.

False-alarm Faces.
WiFi sniffing is an effective approach to detect the presence of users' mobile devices, however, such detection is not guaranteed to perfectly match the user's presence. For instance, a device could be forgotten at home, be out of battery or simply the WiFi function might be turned off. In the simulation, we vary the error rate of such false alarm faces from 0.1 to 0.5. Error rate at 0.1 means that on average, 10% of the detected faces in each event are false detections. Fig. 10a shows the F 1 score and convergence iterations of AutoTune under different levels of such noises. As we can see AutoTune tolerates false-alarm faces well and is able to keep the F 1 score above 0.83 when the false alarm rate is below 0.4, though it degrades to 0.67 when the rate rises up to 0.5. However, we found such case, i.e., on average half of WiFi MAC addresses are missed in all meetings, is rare in the real-world. Finally, we found that false-alarm faces do not affect the convergence and AutoTune quickly converges within 4 iterations in all the cases.

False-alarm Devices.
Though surveillance cameras are becoming increasingly ubiquitous, there are cases where the subjects are not captured by the cameras e.g. duto occlusions. This becomes an instance of device false alarm, if her device MAC address is still sniffed. We vary the rate of such false alarm devices from 0.1 to 0.5, where 0.1 means that on average, 10% of the detected devices in each event are false detections. Fig. 10b shows that although the F 1 score of AutoTune decreases, it degrades slowly and stops at 0.84 after the false-alarm rate becomes 0.5. As the injected noise becomes stronger, AutoTune needs more iterations to converge. However, the largest convergence iteration is still below 7 (rate at 0.4). Overall, AutoTune is very robust to such type of noises.
7.4.4 Non-POI Disturbance. Non-POI disturbance happens when subjects without registered MAC addresses are captured by the camera. We found such noise dominates all the three types of errors.
We vary the number of non-POI from 2 to 10 and the probability of each non-POI's presence in an event is set to 0.1. Fig. 10c shows that AutoTune does not suffer much from mild disturbance (2 non-POI), and the F 1 score drops slowly to 0.87 with larger disturbance (4 non-POI). In addition in all cases AutoTune quickly converges within 5 iterations.

DISCUSSION AND FUTURE WORK
This section discusses some important issues related to AutoTune.
Overheads: Compared with conventional face recognition methods, AutoTune incurs overheads due to FaceNet fine-tuning. In our experiment, fine-tuning is realized on one NVIDIA K80 GPU. One fine-tuning action takes around 1.2 hours and 0.5 hours for the UK dataset and CHN dataset respectively. As we discusses in §7.2, Au-toTune is able to converge within 5 iterations for most of the time, depending on the hyper-parameter setting. Therefore, fine-tuning overheads can be controlled in 7 hours for the UK dataset and 2.5 hours for the CHN dataset. Compared to the FaceNet pre-training that takes days and requires more GPUs, the fine-tuning costs of AutoTune is much cheaper. In future work, we will look into an online version of AutoTune, which can incrementally fine-tune the network on the fly when the data is streaming in. Privacy: In practice, AutoTune requires face images and device ID of users to operate, which may have certain impacts on user privacy. For example, a user could be identified without explicit consent in a new environment, if the owner has the access to the face image of this user. In this work, we do not explicitly study the attack model in this context, we note that potential privacy concerns are worth exploring in future work.

CONCLUSION
In this work, we described AutoTune, a novel pipeline to simultaneously label face images in the wild and adapt a pre-trained deep neural network to recognize the faces of users in new environments. A key insight that motivates it is that enrolment effort of face labelling is unnecessary if a building owner has access to a wireless identifier, e.g., through a smart-phone's MAC address. By learning and refining the noisy and weak association between a user's smart-phone and facial images, AutoTune can fine-tune a deep neural network to tailor it to the environment, users, and conditions of a particular camera or set of cameras. Particularly, a novel soft-association technique is proposed to limit the impact of erroneous decisions taken early on in the training process from corrupting the clusters.Extensive experiment results demonstrate the ability of AutoTune to design an environment-specific, continually evolving facial recognition system with entirely no user effort even if the face and WiFi observations are very noisy.