1 Introduction

One of the most profitable crimes since past is “identity theft”, which means to steal any person’s identity. In a traditional term [1], criminals commit these either by killing the victim and pretend to be that person or steal confidential information from the garbage by accessing information from discarded letters, financial record, electricity bills and many others bills which are dumped without shredding them properly [2]. The term “phishing” is derived from the analogy of “fishing” for victims’ passwords and credentials in the web. The phrase “ph” comes from “phone phreaking”, which was very common technique that attacked telephone systems during 1970s. The word “phishing” was used for the first time over the Internet by a group of hackers in 1996, who stole America Online (AOL) accounts by tricking unaware AOL users into disclosing their passwords [1].

Phishing can be referred to as an automated identity theft, which takes the advantage of human nature and the Internet to trick millions of people and take a large amount of money. An IT industry research group Gartner showed in April 2004 that about 1.8 million American people had already given their information to phishers. It has been observed that in last few years phishing attacks have grown rapidly posing a real threat to global security. The main aim of these campaigns is to exploit the vulnerabilities present in the system, which may be either technical or due to user unawareness, which means that researchers have to provide defense against these attacks at both the technical and the user level. Researchers have tried to achieve the former by employing various approaches, and the latter can be feasible by increasing awareness and educating the Internet users. Phishing campaigns attempt to extract secret data from the victims, which may lead to substantial financial losses. Studies have shown that one-third of all the phishing attempts in 2013 were intended toward bank accounts or to gain other financial information [3]. Since 2012, financial phishing attacks were increased by 8.5 % as compared to 2011, an all-time high responsible for phishing attack [4] ( Fig. 1).

Fig. 1
figure 1

Phishing attack year over year (according to RSA Online Fraud Report, January 2014)

In spite of causing severe financial damages to the users across the Internet, spam and phishing are still growing at a faster rate, and it will continue to do so as long as 1 out of 100,000 recipients actually responds to the phrases like “Click here” in spam emails. According to the Anti-Phishing Working Group (APWG) reports, phishing scams will keep growing with the use of more advanced technologies, and it will become the main threat over Internet, surpassing spam behind, as phishing scams are increasing 56 % per month [5].

In this paper, we present an overview of phishing attacks and many possible defense schemes. This survey gives a broader classification of defense mechanisms, and we provide a set of features used for phishing detection associated with these features ranked according to their ability to classify the phishing emails effectively. We also provide taxonomy of various solutions proposed in the literature that can detect and defend from phishing attacks. In addition, we also discuss various issues and challenges to deal with phishing attacks. We also summarize various tools and datasets used by the researchers for evaluating of their approaches.

The rest of the paper is organized as follows: Sect. 2 of the paper contains the history, background and statistics of phishing attacks. Section 3 describes phishing life cycle. Section 4 presents performance evaluation metrics for judging the anti-phishing system. Section 5 contains various tools and dataset used for evaluation. Section 6 presents taxonomy of various types of phishing attacks. Section 7 provides taxonomy of various types of phishing defense mechanism. Section 8 presents open issues and challenges against phishing detection, and finally, Sect. 9 concludes the paper.

2 History, background and statistics

The word “phishing” was used for the first time over the Internet by a group of hackers in 1996, who stole America Online (AOL) accounts by tricking unaware AOL user into giving their passwords [1]. Table 1 shows growth rate of phishing starts from 1996 to 2014. According to the APWG report [6], the total number of unique phishing websites detected was 125,215 in the first quarter of 2014, which has increased approximately by 11 % in the last quarter of 2013 [7]. It is the second highest number of websites attacked in a quarter, and the first highest was 164,032 in Q1 of 2012. USA remains the most targeted country for these attacks [6]. Most of the phishing campaigns used maliciously registered domains and subdomains. The number of domains has increased from 260 million in April 2013 to 272 million in November 2013 [7, 8]. The attacks targeted 82,163 unique domain names, which is again significantly larger than 53,685 during the first half of 2013. Out of the 22,831 registered fraud domains, 1541 used well-known brand names. Rather than using domain names, some of the attack attempts used IP address, and statistics showed about 2400 such attacks used around 840 IP addresses [9].

Table 1 Evolution of phishing during 1996–2014

Accordingly to a survey in 2013 [10], 62 % organizations were found to be a victim of spear phishing, whereas the survey by InfoSecurity [11] showed that 42 % organizations had faced these attacks. Overall 20 % (18 % in RSA and 32 % in InfoSecurity) said that they have not faced such attacks and 21 % did not know whether that happened or not. The organizations with more than 1000 employees have a higher probability to become a victim of spear phishing [10, 12].

The United States Computer Emergency Readiness Team collected security incident reports from federal, state and local government agencies and processed 107,655 incident reports in 2011, with 43,889 of them involving federal agencies. After processing these incident reports, they found that more than half of those incident reports (Approx. 51.2 %) came from phishing (as shown in Fig. 2). Therefore, for getting a foot into the door of a government network, most popular means is phishing to the hackers by a wide margin [5].

Fig. 2
figure 2

a Summary of total incidents reported to US CERT in FY 2011 and b summary of total incidents (%) reported to US CERT in FY 2011

We also studied statistics of phishing attacks using eCrime Trend Reports. According to [13], in fourth quarter of 2013, .com is the most uses domain for phishing attacks with 41 %, followed by.net with 6 %, .org with 5 %, .br with 4 % and remaining IP address based with 4 % (as shown in Fig. 3).

Fig. 3
figure 3

Statistics of phishing websites based on domain (E-crime report 2013 Q4)

We also found that USA is the most popular country for hosting phishing websites with 45 %. Next most popular phishing websites hosting country is Germany with 6 %, Canada with 3 %, France with 5 %, UK with 4 %, Brazil with 3 %, Russia with 2 % and Poland with 2 % (as shown in Fig. 4).

Fig. 4
figure 4

Statistics of phishing websites based on host countries

According to APWG’s recent report about phishing scams during January to September in 2015 [14], Business Email Compromise (BEC) was dominant, and these attacks deploy spear phishing to trick big organizations or a specific employee. The global rate of infected computers was found to be 36.51 % in first quarter, 32.21 % in the second and 32.12 % in the third. During the first three quarters of 2015, Internet Service Providers (ISPs) were the most targeted sector.

3 Motivation and phishing life cycle

3.1 Motivation

Phishers always take advantage of human nature that generally ignores critical warning messages. Lack of awareness about the phishing attacks in the society is also the main reason why phishing attacks have been so much successful. Whenever any researcher came with some technique to prevent these attacks, phishers try to find out associated loophole to commit successful attacks. Remembering the fact that phishing mainly used for financial gains, there are other factors that also motivate phishers to commit the crime. Motivations behind these activities are as below:

  • Theft of login credentials: Phisher steals login credentials of online services like eBay, Amazon and Gmail from the user using spoofed email as warning message to change password and provided hyperlink.

  • Theft of banking credentials: Online login credentials and credit card details such as card number, expiry and issue dates, cardholder’s name, CCV number and several other popular banking organizations like PayPal, OnlineSBI, HDFC and Citibank.

  • Capture of personal information: Personal information, such as address and telephone number, is highly saleable and in constant demand by direct marketing companies.

  • Theft of trade secrets and confidential documents: With spear phishing techniques, phishers are targeting specific organizations for acquisition of proprietary information and used directly or sold to interested parties.

  • Fame and notoriety: A very interesting psychological aspect of phishing in which information is phished not for financial gain but carried out mainly to gain recognition and notoriety among their peers.

  • Exploit security holes: People who are curious to find out how robust a particular system is may try to write programs to break somebody else’s system to launch phishing attacks or to sell the compromised systems to other phishers.

  • Attack Propagation: Through a mixture of spear phishing and bot agent installations, phishers can use a single compromised host as an internal “jump point” within the organization for future attack.

3.2 Phishing life cycle

The following stages are involved in phishing Attack as shown in Fig. 5.

Fig. 5
figure 5

Phishing life cycle

Stage 1: Planning and setup In the first step, the attackers identify the target organization or individual or a nation. Then, their task is to get details about the organization and its network. It can be done by visiting the place physically or monitor the traffic going in and out of the network. The next step is to set up the attacks by using a feasible means, e.g., website or emails having malicious links, which may redirect the victim to some fraud web page.

Stage 2: Phishing The next step is to send these spoofed emails, e.g., masqueraded as some reputed banking organization to the victim using the collected email addresses, which ask user to update some information urgently by clicking on some malicious link. The emails might be sent to individuals or specific person in an organization.

Stage 3: Break-in/infiltration As soon as the victim opens the fraud link, either a malware is installed on the system which allows the attacker to intrude the system and change its configuration or access rights are changed accordingly. In other cases, it might lead to some fake page that asks for credentials.

Stage 4: Data collection Once the attackers get access to the user’s system, the required data are extracted, and if the user gives his account details to the attacker, they can now access his/her account, and this may led to financial losses to the victim. In case of malware attacks, now the attacker may get remote access to the system and get the data he wants what so ever, or the compromised systems could be used for DDos attacks, etc. Phishers use rootkits to hide their malwares.

Stage 5: Break-out/exfiltration After getting the required information, the phisher now removes all the evidences, i.e., the false websites accounts. It is also observed that they track the degree of success of their attack for refining future attacks.

4 Performance evaluation metrics

The goal of most classifiers is to perform binary classification, i.e., into phishing or a legitimate category where four possibilities exist. Assume that N H denotes the total number of ham emails and N P denotes the total number of phishing emails. If (n h → H) denotes ham messages, then (n p → H) denotes phishing emails classified as ham (n h → P) denotes ham mails classified as phishing and (n p → P) denotes phishing emails classified as phishing. The evaluation metrics used in this case are [15, 16]:

  1. 1.

    True positive (TP): This denotes the ratio of the number of phishing emails identified correctly as:

    $$ {\text{TP}} = \frac{{n_{\text{p}} \to P}}{{N_{\text{P}} }} . $$
    (1)
  2. 2.

    True negative (TN): This denotes the ratio of the number of ham emails identified correctly as:

    $$ {\text{TN}} = \frac{{n_{\text{h}} \to H}}{{N_{\text{H}} }}. $$
    (2)
  3. 3.

    False positive (FP): This denoting the ratio of the number of ham emails classified as phishing, as:

    $$ {\text{FP}} = \frac{{n_{\text{h}} \to P}}{{N_{\text{H}} }}. $$
    (3)
  4. 4.

    False negative (FN): Ratio denoting the number of phishing emails classified as ham, as:

    $$ {\text{FN}} = \frac{{n_{\text{p}} \to H}}{{N_{\text{P}} }}. $$
    (4)
  5. 5.

    Precision (p): Measures the rate of phishing emails which are identified correctly as the emails detected as phishing:

    $$ p = \frac{{n_{\text{h}} \to P}}{{n_{\text{p}} \to P + n_{\text{h}} \to P}}. $$
    (5)
  6. 6.

    Recall (r): Measures the rate of phishing emails which are identified correctly as existing phishing emails:

    $$ r = \frac{{n_{\text{p}} \to P}}{{n_{\text{p}} \to P + n_{\text{p}} \to H}}. $$
    (6)
  7. 7.

    f 1 score: This is the harmonic mean of Precision and Recall:

    $$ f_{1} = \frac{2p.r}{p + r}. $$
    (7)
  8. 8.

    Accuracy (ACC): Measures overall correctly identified emails:

    $$ {\text{ACC}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}. $$
    (8)
  9. 9.

    Specificity (S): Measures correctly identified ham emails:

    $$ S = \frac{\text{TP}}{{{\text{TN}} + {\text{FP}}}}. $$
    (9)

In some of the studies where specific features are used to identify the category of an email, the evaluation metrics used are [17]:

  1. 1.

    Entropy (E): Measures the amount of disorder or disturbance in the system. It can be calculated as:

    $$ E\left( S \right) = \sum\limits_{i = 1}^{N} { - p_{i} \log_{2} p_{i} } , $$
    (10)

    where N number of classes in the dataset, S dataset, and p i probability of an email belonging to class i.

  2. 2.

    Information gain (IG): Measures decrease in the value of entropy when a particular feature is used. IG(S, A) is the information gain of dataset S over the attribute A and can be obtained as:

    $$ {\text{IG}}\left( {S,A} \right) = E\left( S \right) - \sum\limits_{{v \in {\text{value}}\left( A \right)}} {\frac{{S_{\text{v}} }}{s}E} \left( {S_{\text{v}} } \right), $$
    (11)

    where S v the number of attributes in S with A has the value of v, and E(S v) entropy of the subset S v in S.

5 Datasets and tools used for evaluation

Many datasets are freely available on the Internet, which are commonly used for the experimentation and evaluation of phishing detection algorithm. This section presents a brief description of most popular phishing and ham datasets.

5.1 Standard datasets

Phishing archive: The Anti-Phishing Work group’s “Phishing Archive”, is a record of phishing attacks, which are either reported to APWG or detected by APWG [18]. This dataset is used by Dhamija et al. [19] and Abburous et al. [20] in performing their evaluation.

PhishTank: PhishTank website stores the phishing data reported by the user. The phishing data are shared via the website and can also be accessed through an API [21].

Corpora: The corpora of the SpamAssassin project contains three parts of spam corpora, easy ham which could be easily distinguishable from spam and hard ham which are hard to be distinguished from spam [22]. A recent addition to this corpus is easy ham_2, a ham dataset, and spam_2, a spam dataset. This dataset is used by I. Fette et al. [23] for the evaluation of their algorithm PILFER and M. Khonji et al. [24] for implementing LUA algorithm.

Enron dataset: The Enron Dataset [25] has been collected by the CALO [26] project consisting of more than 150 employees. Initially, the dataset had some integrity problems, but it was fixed by Bryan Klimt and Yiming Yang [27]. This dataset is used by Georgala et al. [28]. The dataset includes comprises of personal emails. The ham messages are collected from six Enron employees and the TREC 2005 Spam Track public corpus. The dataset contains approximately 50,000 spam and 43,000 ham emails. It is considered as a benchmark dataset.

TREC: Other commonly used dataset are TREC corpus [29] used by Al-Daeef et al. with the copyright being held by waterloo university. The TREC 2005 corpus has been created for spam evaluation, and it contains 92,189 emails ordered chronologically. The dataset contains 39,399 ham emails and 52,790 spam emails. TREC 2006 and 2007 are also available from their websites.

IronPorts: It has been designed in 2000 by Scott Banister and Scott Weiss, to provide defense against any Internet threat. It was acquired by Cisco in 2007. Iron Port’s corpus [30] was used by Tyler Moore et al. [31]. Dataset is a collection of messages that arrive at their spam traps and emails submitted by the customers.

Iron Port’s SpamCop [32], founded by Jullian Haight in 1998 and acquired by Iron Port in 2003, is a service which keeps a record of spam reported by the recipients of commercial emails or UBEs (Unsolicited Bulk Emails). It has a number of spam traps at geographically different locations, thereby acting as a major contributor to Iron Port corpus. SpamCop processes all these reported spam and creates a list of systems used in sending those emails which are blacklisted by SpamCop.

Phishload: In 2012, the phishing database was created by Max-Emanuel Maurer, which is named as Phishload [33] and contains HTML code, URL, and other information related to phishing websites. It also has about a 1000 legitimate and target websites (Table 2).

Table 2 Dataset description

5.2 Various tools used for evaluation

This section describes various tools used for experimental purpose and to judge the accuracy of an anti-phishing system. A researcher can select a tool depending on various parameters and algorithms used, e.g., if an approach uses a data mining algorithm for phishing detection, then WEKA tool can be used. Figure 6 shows numerous tools that can be used for evaluation of phishing detection, and Table 3 gives an overview of these tools and their possible application in various fields.

Fig. 6
figure 6

Tools used for evaluation

Table 3 Experimental tools and their applications

6 Taxonomy of phishing attacks

Phishing attacks are broadly classified into two categories: social engineering- and malware-based phishing attacks. In social engineering-based phishing, the attackers try to acquire the targets’ credentials by using some fake website or sending fake emails that appear to be legitimate to trick the user [3436]. Social engineering, also known as deceptive phishing, can be further classified as: (1) email phishing and (2) website phishing. Similarly, malware-based phishing attacks use a variety of malicious programs, which are primarily unwanted software running on target’s system. These attacks can be further classified as: key loggers/screen loggers, session hijacking, host file poisoning, DNS phishing and content injection. The classification of phishing attacks is shown in Fig. 7.

Fig. 7
figure 7

Types of phishing attacks

Some of the mechanism or measures used to carry on these phishing attacks are summarized in the following paragraphs.

6.1 Phishing using compromised web servers

In these categories, the attackers search for vulnerable servers and install a secret exit or a backdoor that enables them to access a compromised server if the server is a web server. Phishing websites are downloaded, which then start receiving traffic and victim begins to access the malicious contents [37]. In the study conducted by Tyler Moore et al. [38], it was found that 76 % of the examined phishing websites were hosted on a compromised web server.

6.2 Phishing through botnets

A botnet refers to a network of infected computers, which are controlled by an attacker from a remote location, being large in size such that they pose a serious threat when used as denial-of-service (DDoS) attacks the attackers these days also make use of botnets for sending spam emails and phishing attacks [37]. In a study by Cipher Trust (an email security company) in October 2004, it has been shown that 70 % of recorded phishing mails are sent using one of five active botnets. Other observations also showed that a large number of unrecorded botnets are in use causing such attacks. Botnets can generate traffic to consume very high bandwidth, and a collection of 20,000 such machines can take down about 90 % of the websites [39, 40].

6.3 Phishing through port redirection

This mechanism makes it difficult to trace the location of the source of attack. Here, no phishing content is uploaded directly. Instead port redirection services are used by the attackers designed for purpose of re-routing web requests sent to the server to some other remote web server [37]. The tool generally used for port redirection is Fpipe [41].

6.4 Social engineering

Social engineering attacks intend to acquire victim’s identity or other confidential information through spoofed or fake emails. Social engineering attacks are brought into action with similar motive as that of hacking, i.e., to acquiring illegal access to a system or gain confidential information about an organization or an individual, network intrusion, etc. The common targets include big corporations, military and government agencies [42]. Social engineering attacks target at two levels: physical and psychological [43].

The first target is the physical setup where the attacks are to take place, and it may be the workplace, the phone and even online. The attackers lay stress on finding a way to create a favorable psychological environment in order to make that attack work. Despite the method used, the primary goal is to make the victim believe that the attacker is a genuine person so that they can give him the required details [44]. They never try to get a lot of information at one time from one user. Instead, the attackers try to get small details from many people in order to gain their trust.

6.4.1 Website phishing

Website phishing is a phonological attack with an objective of targeting a particular person rather than a system. These attacks are very easy to put into action due to the fact that creating a phishing website that is an exact replica of some legitimate site is not a problem for the attacker [45]. The main aim is to defraud people in order to gain their personal and financial details. Moreover, it is a very complex task to detect phishing websites as it is primarily a combination of technical and social problems. These attacks aim to compromise an individual or an organizations’ confidential information.

6.4.2 Email phishing

The phisher’s first step is to launch a phishing website; then, it sends large volume of fake mails which may ask the user to click on a link within that mail and may give away his/her identity or other personal information which is passed onto the phisher by a phishing server. The phisher then uses the victim’s identity to gain illegal financial benefits or for some other purpose. Despite the fact that phishing emails have been developed in a better style over time, there are still a number of measures or clues that indicate their deceptive nature. A new variant of email phishing that has become increasingly common these days is spear phishing in which the type and the targets of the phishing attack are the main concern.

6.4.2.1 Protection from phishing emails

Phishing email message transportation is represented in Fig. 8 [12]. To detect phishing emails from ham emails, the framework in an online mode is set between message transfer agent (MTA) and mail user agent (MUA) so as to stop phishing email from reaching to victim’s account (before it get to user).

Fig. 8
figure 8

Phishing email message transportation [12]

MTA (message transfer agent): Acts as a post office for storing and acting as an email carrier.

MUA (mail user agent): A software program using for retrieving emails like “Microsoft outlook”.

MDA (message delivery agents): Act as mailboxes, which store messages (as much as their volume will allow) until the recipients check the box.

Phisher: Malicious user who sends a phishing message to a potential victim.

Victim: User who may open for phishing email and become the target of a phisher.

An overview of email data parts is shown in Fig. 9 [46].

Fig. 9
figure 9

An overview of email data parts [46]

6.4.2.2 Spear phishing

Spear phishing may be defined as “highly targeted phishing aimed at specific individuals or groups within an organization” [47]. Spear phishing was first studied in 2005 as context aware phishing by Jacobson et al. [43]. A few years later, Jagatic et al. [42] found that the number of users who become a victim in spear phishing is 4.5 times larger than the general phishing attacks.

6.4.2.3 Stages in spear phishing attacks

The spear phishing attacks start with the attacker collecting information about the organization and end with getting access to the required information. Figure 10 shows the steps involved in spear phishing [48].

Fig. 10
figure 10

Stages in phishing attacks

6.4.2.4 Spear phishing: one of the biggest cyber security threats

Spear phishing has become more common these days. The phisher disguises the email to look as it came from someone in the same organization, which increases the probability for the attack to be successful as people are more to open an email if appears to be coming from someone familiar making these attacks very hard to detect and the attacker is successful in breaching the organization without anyone knowing it. The US Department of Defense had also once been a target of spear phishing which caused the Joint Task Force-Global Network to educate the employees about these kinds of attacks [47, 49].

6.5 Malware-based phishing

Malware is malicious software that is usually installed on a machine without the knowledge of a victim, and sometimes even the victim can be tricked into downloading anti-virus software while it really is the virus or malware itself [50, 51]. Malware can access user’s confidential data and send it to the phisher. Malware takes advantage of the gaps in the browser software and operating systems, or make use of deceptive techniques to encourage the victim to execute the malicious code.

According to APWG first quarter report of 2013 [3], around 5 million malware samples are reported by the APWG company PandaLabs which has increased set of new malware samples to be 15 million. Trojans remain the most commonly used malware, which is 72 % of all malware samples found. The total number of infected systems across the globe is about 33 %. China has the highest number of infected systems (Table 4).

Table 4 Percentage of all malwares captured in Q1 of 2014 [6]

6.5.1 Key loggers and screen loggers

Key loggers also pose severe threat to the systems as people are unable to detect their presence, and the screen recording software has made it even worse the situation due to key logging as virtual keyboards have no utility. Key loggers can be categorized as: hardware key loggers and software key loggers [52].

  1. 1.

    Hardware key loggers: Hardware key loggers are devices that record the data that are entered via keyboards into their memory and do not make use of any resources of the system as any anti-viral software cannot recognize them, and also these are very small in size and hide themselves well in a system, and they may send key hits to some attacker at remote location [48].

  2. 2.

    Software key loggers: Software key loggers examine the data of the operating system and the data entered through keyboard of the victim’s system and record them at remote locations which are later sent to the attacker. Virtual keyboards are helpful and are not affected by these softwares as virtual keyboard data are entered by a mouse [48, 52].

  3. 3.

    Screen recording software: Screen recording software captures the screen and mouse movements for monitoring. But these are also used adversely which makes virtual keyboard not a safe option. This software records the activities going on at the screen like keystrokes using virtual keyboards [48].

6.5.2 Session hijacking

Session hijacks can occur either at Network Level or Application Level. At network layer session, hijack involves interfering TCP and UDP sessions, whereas session hijack at application level involves interfering HTTP sessions [53].

  1. 1.

    TCP session/hijack: In this case, an existing TCP connection between any two communicating systems is intercepted and the attacker disguises himself to be one of them and redirects the TCP traffic toward him/her by inserting fake IP packets so that the processing of the commands is done by the authenticated host. It desynchronizes the session between the actual communicating hosts. Authentication is done only at the time of the establishment of a connection; thus, an already established connection can be hijacked without any authentication (Fig. 11).

    Fig. 11
    figure 11

    TCP session hijack

  2. 2.

    UDP session hijacking: It is easier to hijack a UDP session as compared to a TCP session as synchronization, and sequencing of packets is not required. In this case, the attacker sends a fake UDP reply to the host on the behalf of server before the server can respond.

  3. 3.

    Hijacking application levels: At application level, either a new session is initiated by using a stolen data or hijacking an existing session can also take place.

HTTP session hijack: IDs are obtained corresponding to a session. The IDs are only unique identifier for an HTTP session and can be extracted from the URL that the browser receives for the HTTP GET request, cookies at the clients system and the form fields as shown in Fig. 12.

Fig. 12
figure 12

HTTP session hijack

  1. 4.

    Session hijacking in WLANs: Session hijacking has become a serious threat to WLANs as it exploits vulnerabilities of the network and can be brought into action using commercial off-the-shelf tools. The attacker forces a legitimate station to disconnect itself from its access point and using that station’s address; the attacker then connects itself to that access point [54].

6.5.3 Host file poisoning

Hosts file poisoning refers to injecting new entries for websites into a machine’s host file, which redirects the websites to another site. When a client inputs a URL, it is converted into an IP address before sending over the Internet; hackers have bogus address transmitted by poisoning the host files, redirecting the user to a fraud website where they are required to give their personal information [55].

6.5.4 DNS phishing

In DNS phishing attacks, initially a reprobate access point is created where the attacker runs a fake DNS server which tempts the client to connect to it. Using this fake DNS server, particular sites are redirected to the attacker’s phishing server [56]. Figure 13 shows the mechanism of DNS phishing attack.

Fig. 13
figure 13

DNS phishing attacks mechanism

6.5.5 System reconfiguration attacks

In system reconfiguration attack, the user receives a message asking to reconfigure the computer settings from the attacker. That message may come from a web address which appears to be a reliable source [57].

System reconfiguration attack can be categorized as:

  1. 1.

    Pharming: It is also referred to as host name lookup attacks, which interfere with the host file of the victim’s system and modifies the target address so that it directs to some other malicious location.

  2. 2.

    Proxy attack: Another type of reconfiguration attack is the installation of a proxy through which the Internet traffic is passed from the victim to the server so that the attacker is able to extract confidential information from it.

6.5.6 Content injection

These attacks can easily take place if an application is unable to handle given data by the user in a proper manner; attacker can give contents of that application through some parameter value, which results into a modified page that appears to be from a trusted domain. An attacker first identifies a vulnerable parameter and then crafts a link by making a little change in a valid request which is sent to the user [58].

6.5.7 Phishing through search engines

Phishing attacks involving search engines direct the user to certain online shopping sites where cost for the products or services is low which may tempt the user to buy the product by giving the credit card details at that phishing site. Nowadays there are many existing fraud websites that offer credit cards or loans to users at a low rate. The attackers can make use of spam indexing, which is a technique that could for the manipulating indexing in search engines so as to enhance the rank of the malicious web page to fool the user. It can be done by inserting some keywords that make the web page seem relevant or legitimate [59]; the steps followed are shown in Fig. 14 (Table 5).

Fig. 14
figure 14

Spamdexing stages

Table 5 Overview of malware attacks

7 Taxonomy of phishing defense mechanisms

Phishing email is a kind of spam mail, which is a criminal mechanism relying on fake email claims for a legitimate organization. The basic objective is to steal personal or confidential information from the victims. In this section, we discuss various approaches to filter phishing emails and detect malicious web pages.

7.1 Features used for identification of phishing scams

Toolan and Carthy [60] studied the utility of about 40 such features and evaluated their effectives using information gain and entropy. They have categorized common features used for detection of phishing emails as:

  1. 1.

    Body-based features: These features are extracted from the email body. They include binary features such as presence of forms, HTML or certain phrases and links in the email body.

  2. 2.

    Subject-based features: Some features are extracted from the subject of an email such as whether it is a reply to some previous mail, or the presence of certain words like verify, debit.

  3. 3.

    URL-based features: These features check whether an IP address is used instead of domain name, the presence of @ in the links, number of images, external and internal links in the email text, the count of periods in the links, etc.

  4. 4.

    Script-based features: These features check for the presence of JavaScript, pop-up window code, onClick events, etc., in the email.

  5. 5.

    Sender-based features: These features include the sender’s details such as difference between the sender’s address and the reply to address.

The most effective features are extracted by analyzing data parts C and D of an email message (as shown in Fig. 15). A common approach in extracting features found in data parts A and B is to use a blacklist, which is not efficient as blacklists perform poorly against zero-day phishing attacks [46]. The groups of most effective features of an email are discussed in Table 6.

Fig. 15
figure 15

Features types for phishing detection

Table 6 Most effective features of an email

Table 6 shows four groups of features: external features (group 1), body-based features (group 2), URL-based features (group 3) and features header (group 4). Phishing emails a traditional and one common way for phishing frauds. Users through Mail User Agent (MUA) transfer any phishing mail from Mail Transfer Agent (MTA), which transferred email to Mail Delivery Agents (MDA) and then finally received. Figure 8 as used in [61] shows the procedure of phishing email transferred to a computer network. These features are also ranked in [17] based on their information gain based on the overall best, worst and median features as given in Table 7.

Table 7 Classification of features according to their efficiency

7.2 Classification of protection against phishing attacks

Classification of different protection mechanisms against phishing attack is shown in Fig. 16.

Fig. 16
figure 16

Taxonomy of phishing detection approaches [4, 122]

7.2.1 User education

User education refers to spreading awareness and education about phishing among Internet users. Education-based approaches offer online information about risk of such attacks and their prevention techniques [62]. Some approaches also provide online training and testing to the users.

7.2.1.1 User’s response to phishing attacks

Downs et al. [63] performed a study and concluded that the user should be educated about phishing rather than warning them about the negative consequences of attacks. In this study, 232 computer users were asked to view some emails and answer few question related to those. This study shows that user having knowledge about URLs and locks is less likely to fall for phishing attacks, whereas understanding of other web tools, e.g., cookies and malicious software, did not reduce the likelihood to fall for phishing attacks. Huang et al. [64] studied that users fall for such attacks because either they are not able to differentiate between a legitimate or phishing website or due to ignorance of the warnings and toolbars indicators.

Shen et al. [65] showed some of the indirect characteristics, i.e., females are more likely to fall victims to phishing attacks than males. The same is true for people between 18 and 25 years of age due to the lack of awareness and technical knowledge. Don et al. [66] proposed a model that describes user interaction with respect to decision making, which starts as soon as the user views the phishing mail or web page and stops when the user ends its activities. The goal is to detect phishing campaigns by understanding the way users react to phishing web pages or mails.

7.2.1.2 Prevention of attacks by warning the user

Whenever a user clicks on a malicious link or views a phishing website, web browsers issue security warnings that can be of two types: (1) active warnings, which blocks malicious contents preventing the user from viewing it and (2) passive warnings, which display a pop-up window warning the users while the contents are being viewed. It is shown by studies that active warnings are more effective than passive ones, as user tends to ignore the warnings unless they are blocked from viewing the content [67].

Moore and Clayton [20] stated that service take down is the most common method used by the service providers to handle security problems. Service providers are also required to enforce the rules strictly against illegal use of services of the services provided by them. Egelman et al. [68] showed that passive warnings are ineffective as only about 13 % participants notice passive warnings given by the browser, while active warnings noticed by 79 % of the participants. This justifies failure of the security toolbars as they mostly follow passive warnings. The study conducted by Kumaraguru et al. [69] shows that generated periodic security warnings are ineffective as they can only enhance user’s knowledge but cannot force to change their attitude. They also proposed a design of a new method of sending educational notices to users frequently.

Arachchilage and Cole [70, 71] developed a mobile game for Internet users to make them familiar with phishing attacks, and they used Technology Threat Avoidance Theory (TTAT) for the designing of the game. The main objective is to enhance the user knowledge and remove the ignorant behavior among the users. In [72], another TTAT-based model is presented, showing that a well-designed user education program is always helpful in prevention of attacks (Table 8).

Table 8 Overview of defense against phishing techniques
7.2.1.3 Online training

Training the web users to identify malicious emails and websites can be very effective in preventing phishing attacks. Through last few years, various methods have been proposed to train the users online, and training and testing the users using games have also been proved to be an effective method. Kumarguru et al. [69] proposed a design of a new method in which educational notices are sent to users at fixed intervals. They also propose a training method embedded into daily tasks of the user so that he/she is not required to read from any other outside sources. This study showed that user training programs are more successful as compared to educational notices as 89 % of users who train through educational notices fall for phishing attacks, whereas only 30 % of users had training messages included in their daily tasks, were victim of such attacks. But, this approach requires an administrator to handle the messages who should be aware of all the latest aspects of phishing which is a limitation of this approach.

7.2.2 Software-based defense approaches

7.2.2.1 Protection at network level

In this approach, certain range of IP addresses or a set of domain is not allowed to enter the network. DNSBLs [73] make use of the DNS protocol and are created and updated regularly by observing the network traffic. An open-source software Snort can also be used at the network level although these require continuously updated.

7.2.2.2 Authentication-based mechanisms

In this approach, it is confirmed whether or not the message was sent by a valid path and domain name and can be employed at both the user and domain level. These techniques enhance the security of email communication. The authentication schemes are fairly simple and can be done at the domain level or by digitally signing the document before sending. But these require the same technology to be used at both the sender and receivers’ sides. Another authentication mechanism called transaction authentication numbers is used by the banks. But, they do not ensure security from man-in-the-middle attacks and are costly in terms of time and computation [74].

7.2.2.3 Client-side tools

These include user profile filter and browser-based toolbars. Other techniques are domain checks, URL examination, etc. These tools also depend on blacklisting and whitelisting techniques where a list of detected phishing or legitimate websites is downloaded with updates at standard intervals. The limitations of these techniques are their fail to detect zero-day attack.

I. Phishing detection by blacklists and whitelists Blacklists contain URLs and IP addresses, which are found to be suspicious and are frequently updated. But they do not provide any protection from zero-day phishing attacks and can only detect only 20 % of these attacks. The conducted studies conclude that 47–83 % of phishing URL are blacklisted after 12 h. This delay is significant as 63 % of phishing attacks end within the first 2 h [75]. Some of the approaches making use of blacklists are: Google safe browsing API, DNS-based blacklists, predictive blacklisting and automated individual whitelist.

(1) Google safe browsing API Google provides a service for safe browsing that allows the applications to verify the URLs using a list of suspicious pages which are regularly updated by Google. It is an experimental API and is used by Google Chrome and Mozilla Firefox. The Safe Browsing service provides two experimental APIs:

(a) Safe browsing lookup API The Safe Browsing Lookup API [74, 76] allows the clients to send suspicious URLs to Safe Browsing service which tells whether the URL is legitimate or malicious. The client API sends the URLs with GET or POST request, which are checked using the malware and phishing lists provided by Google with the current version being used is 3.1. Some of the shortcomings of Safe Browsing Lookup API are: (i) no hashing is performed before sending URLs and (ii) there is no limit on the response time by the lookup server.

(b) Safe browsing API v3 Using the Safe Browsing API [77], the client can download a table of URLs for client-side lookups. The Safe Browsing API version v3 was introduced in 2014, and afterward the Safe Browsing API version v2 was deprecated. Phishing and malware URLs are published in two different blacklists which are googpub-phish-shavar and goog-malware-shavar, both of which have SHA-256 hash values ending with a 4-byte hash prefix.

(2) DNS-based blacklist (DNSBL) A DNSBL is a zone that contains resource records for the identification of hosts present in the blacklist and uses DNS protocol. Hosts undergo an IP address or domain name transformation to be encoded into DNSBL zones. There must be an A record and TXT record, which gives the reason for blacklisting for each entry in the DNSBL [75]. The standard value of A record contents is 127.0.0.2, but they may have other values too. DNSBLs can use the same TXT records for all entries or a different for each entry.

(a) IPv6 DNSBLs The structure of DNSBLs using IPv6 addresses is defined as a domain in [78]. The entry names are IPv6 address with DNSBL domain as their suffix. The A and TXT records are used similar to that of IPv4 DNSBLs. A single DNSBL can have IPv4 and IPv6 addresses. The representation of IPv6 lists is similar to IPv4, where the 4 octet address is replaced by 32 nibbles of IPv6 address.

(b) Domain name DNSBLs Domain names are less frequently used by DNSBLs than the IP addresses. The interpretation of records and TXT is the same as that of the IPv4 DNSBLs. The system manager must be cautious while choosing DNSBLs; the management policies of the server management and the system manager should be consistent. If it is not the case, then the addresses from which the system is expecting mail might get blocked (Fig. 17).

Fig. 17
figure 17

DNS-based blacklisting mechanism

(3) PhishNet: Predictive blacklisting The attacker applies some simple changes to the URL; PhishNet [79] detects these changes using two components:

(a) Malicious URL prediction: PhishNet examines the blacklisted URLs and uses some heuristics to create new variations of that URL, which are as follows:

  • Replacing top-level domains (TLDs) with 3209 different TLDs resulting into child URLs which ought to be examined.

  • To generate new URLs clusters of host equivalence classes having the same IP address are maintained, all combinations of these hostnames and paths are then used to create new URLs.

  • The URLs with the same directory are grouped together, and new URLs are created by exchanging filenames within that group.

  • If two URLs have the same directory structure with different query part, the query part can be swapped to create new URLs.

  • New URLs are created by substituting the brand names in the phishing URLs.

Once the children URL are generated, they are subjected to a validation process which eliminates legitimate URLs.

(b) URL matching Incoming URL are matched with the blacklisted URLs using regular expressions and hash maps.

(4) Automated individual whitelist AIWL [80] keep records of legitimate login user interfaces (LUIs) of web pages. Whenever a user submits his/her credentials to LUI, the whitelist is checked for it and if it is not on the list, a warning is given to the user.

AIWL has two primary components:

  • Whitelist: It contains a list of legitimate LUIs and is used to check whether a URL is familiar or suspicious and so that warnings are suppressed. In the whitelist, each LUI is stored as a vector comprises of URL address, page feature, DNS-IP mapping, etc.

  • Automated whitelist maintainer: It is a classifier, i.e., naive Bayes, which decides whether to store LUI in the whitelist. The whitelist maintainer checks the number of logins for a specific LUI; if it exceeds a certain threshold, then that LUI is whitelisted.

II. Phishing detection toolbars and plug-ins Phishing heuristics are characteristics that are found in phishing attack; however, the characteristics are not guaranteed to exist in each case. If a set of general heuristics can be identified, it is possible to detect zero-hour phishing attacks. Some of the methods to detect these heuristics are: SpoofGuard, PhishGuard, Phishwish, CANTINA, etc.

(1) PhishGuard: A browser plug-in PhishGuard [81] is a browser plug-in non-tunneled phishing attacks that do not involve legitimate sites being used for tunneling the output to the victim and is achieved by testing HTTP Digest authentications. The plug-in can possibly incorporate other authentication mechanism too. PhishGuard triggers into action when it detects the start of an authentication process where a user will submit a user ID and a password (or some equivalent data). PhishGuard would forward the real user ID to the page but some (random) incorrect password instead of the real one repeatedly, a certain number of times. If the page replies negatively, i.e., if the HTTP response code is 401, then there is a good chance it is a legitimate site. On the other hand, if the page replies positively, i.e., if the HTTP response code is 200, then the site can be considered as a phishing site. The final result is reached on the basis of password hashes maintained by the tool, which declares the site a phishing site if it already has the hash of the entered password, otherwise the user is prompted to reenter the password.

(2) Phishwish Phishwish [82] is a mechanism consisting of 11 rules to detect phishing message or email. The idea is to provide better protection against zero-hour attacks than blacklists with minimal false positives. It requires lesser resources (11 rules) as compared to SpamAssassin which uses 795 rules. Phishwish analyzes the email header and URL that is contained in the email’s body.

The message is referred to as phishing in the following cases (rules):

  1. 1.

    If the URL present in the message redirects the user to a login page which is not authentic (some organizations’ original login page) and checked by the help of a search.

  2. 2.

    If URL uses Transport Layer Security (TLS) in a HTML formatted email, but not in the actual HREF attribute.

  3. 3.

    If the URL has an IP address instead of a domain name for the host.

  4. 4.

    If the name of the organization (e.g., eBay) is given in the URL, but not in the domain name.

  5. 5.

    If domain name in the URL does not match the domain name in the HREF attribute.

  6. 6.

    If the organization’s domain name is not present in the received SMTP header.

  7. 7.

    If there are inconsistencies in URL’s domain portion, the result is positive.

  8. 8.

    If there are inconsistencies in the image link’s domain part.

  9. 9.

    If there are inconsistencies in the WHOIS records of non-image URL’s domain part.

  10. 10.

    If there are inconsistencies in the WHOIS information of image link’s domain portion.

  11. 11.

    If the page cannot be accessed.

The email score is calculated by the weighted mean of these rules. If the score of a given email is greater than 50 %, then it is considered as phishing, otherwise as legitimate.

(3) CANTINA CANTINA [83] is a content-based approach that decides whether a visited page is legitimate or phishing. CANTINA deploys term frequency–inverse document frequency (TF-IDF), rule-based heuristics and search engine output to reduce false positives. CANTINA when making use of TF-IDF along with some simple heuristics is able to detect about 90 % of the phishing websites with 1 % false positive rate. But it suffers from performance issues due to the delay in querying from search engine.

(4) Other browser-based toolbars Some other browser-based toolbars are: SpoofGuard [84], which a browser plug-in developed at the Stanford University. It detects phishing attacks based on HTTP(S), by taking certain irregularities found in the HTML content into account against a previously defined threshold. Some other browser-based toolbars are NetCraft [85], CloudMark [86], IE phishing Filter [87], eBay toolbar [88] used at the client side. These toolbars are developed and trained using the URLs of phishing web pages, and they give warning to the user when encounter a suspicious pages are encountered.

7.2.2.4 Server-side filters and classifiers

These are based on content filtering approaches and are appropriate to fight zero-day attacks. These filters are based on machine learning techniques and are categorized as:

(I) Phishing detection based on machine learning In this method, input data are considered to be an unordered set of words and are based on machine learning classifiers such as naive Bayes classifiers, support vector machines (SVM), k-nearest neighbors, Boosting and TF-DIF. SVM is the most popular method of all, and Chandrashekharan et al. [89] used one-class SVM to train some email samples in an already defined plane and used features to map the samples into a new transformed space. The two classes of the emails, e.g., ham and phishing, are separated by a hyper plane. TF-IDF [90] uses document frequency of a word, i.e., in how many documents the words occur. K-nearest neighbors algorithm uses a similarity function for training the datasets; then, the mails are labeled to one of the created previously cluster [91]. Naïve Bayes is also commonly used for the purpose of text classification [92]. It uses the Bayes theorem for classification, and the features should be statistically independent to get good results.

(II) Phishing detection by data mining The techniques that come under this category consider phishing to be a classification or clustering problem, and algorithms such as machine learning, k-means clustering or SVM are applied to them.

Gang Liu et al. [93] proposed an approach to detect phishing websites by finding websites similar to it and comparing it with them. If the websites are similar to the suspicious site but having different domain name are found, then the website is said to be phishing. They extracted features from URL and keywords, i.e., similarity in the text, layout, number of links in the web page, etc. DBSCAN algorithm is used to detect similarity by comparing the suspected website with all the websites, and the website with highest similarity is said to be the target website. For the evaluation of their approach, the authors used 8745 websites from PhishTank and 1000 legitimate websites were collected from Random yahoo link.

Bazarganigilani [94] used ontology concept for the classification of text of phishing emails and considers every word to be an attribute and its number of occurrences to be the value (which is referred to as the ontology concept) for the classification purpose. The term frequency variance, information gain and adaptive Naïve Bayes technique are used that gives 94.87 % accuracy.

Chandrashekharan et al. [95] used structural features for detection of phishing emails. From each email, features, such as a ratio of total number of words to total number of characters and frequency of certain words, are extracted. This approach made use of simulated annealing for feature selection and SVM as classifier with 95 % of accuracy. But the approach was tested only on a small dataset.

Kim et al. [96] devised an algorithm that detects DNS-based poisoning attacks on the basis of network level features. The data they used for processing were 10,000 routing information instances destined toward phishing and legitimate servers. Each instance consists of mean round-trip time, hop count between the user and the service accessed and whether the service was behind a firewall. Their study showed that only 19 % of phishing websites are behind firewalls, while 79 % of the legitimate websites were behind firewall. Thus, the phishing websites are hosted in less secure hosts than the legitimate target sites. The routing information has been processed by algorithm like SVM and k-NN.

PHONEY [97] detects links and HTML forms in the incoming email. Then, the control transfers to the content scanner that analyses and takes the data from the web page. The data thus obtained are compared with the DB entries, e.g., usernames and passwords. As the technique was tested on a very small amount of data, it cannot be determined whether it can address real-time phishing scams. Haijun Zhang et al. [98] proposed techniques that used the content information to give a class label to a suspected website. They proposed a number of techniques but the most effective of them made use of naïve Bayesian classifier and image processing to compare both the textual and visual characteristics of the sites. The naïve Bayes classifier gives a normalized number, which specifies similarity between the text of suspected and legitimate websites. The image processing technique measures the similarity between the appearances of both websites. The outputs of both the approaches are then examined and normalized for a selected interval and have the highest probability of belonging to that interval decides the label of the suspected website.

Ma et al. [99] proposed a model to detect phishing emails using hybrid features. It has five stages:

  • A feature generator to extract seven features out of the email.

  • For feature selection, an adaptive machine learning algorithm is used which is a combination of five algorithms.

  • Calculation of information gain.

  • A small vector of features for evaluation.

  • In the last stage, a matrix of features is created for optimization of features. The decision tree algorithm gives the best results for the short feature vectors.

PILFER [23] technique is used to detect phishing emails. Here, the emails are represented using 10 features, and spam filter output is also considered to be a feature. For a classification, tenfold cross-validation with random forest is used. For training and testing, the dataset SVM is used. Since the phishing sites are short-lived, many of the features cannot be extracted from old emails, but still PILFER can classify emails with 99.5 % accuracy. It is more accurate than spam filter alone with the false positive rate of 0.0013(approximately) and false negative rate of 0.035(approximately), while without spam filter the output PILFER has comparable accuracy by spam filter.

Cluster phishing emails automatically uses features such as document size, message content and HTML features. For extracting these features, an adaptive k-means clustering is used. An objective function is produced with final value determined by the optimal cluster [100].

Beuskova et al. [101] proposed an approach that combines both supervised and unsupervised learning classification approaches. In order to randomize the input data, independent unsupervised clustering is used. The next step is to build a consensus clustering which is a combination of trained clusters on which supervised clustering is applied for the classification of the already clustered data.

In [102], a country-based model for phishing detection has been proposed; the objective was to develop an anti-phishing framework in accordance with a particular country’s Internet infrastructure (Saudi Arabia in this case). The provided exposure of their model prototype to the victims within the country and for deployment, their aim was to detect phishing web pages instead of blocking them.

In [103], two features were used for the determination of web pages’ identity solely; they do not need any other services. They proposed a model called PhishDetector, which is rule-based for obtaining hidden knowledge, and it is also able to detect zero-day phishing attacks.

Some other proposed techniques made use of multidimensional feature vectors and used information gain for feature selection, and any classification model can be used. Other techniques are Bayesian anti-phishing toolbar or the use of natural language processing for intrusion detection.

(III) Phishing detection based on soft computing techniques In this approach, knowledge discovery is used to simplify the evolution process, which can be a group of networks that are executing continuously and changing their architecture and functions in parallel and are also consistent with the environments and systems related to them. There are no dimensions fixed, and the system grows in free space learning continuously as an individual and a part of a system [104]. Evolving clustering method for classification is used in [105] to develop a model for phishing detection which performs the classification using some features. Other approaches by Almomani et al. [106, 107] are based on fuzzy neural networks to classify phishing and legitimate emails.

(IV) Multilayered phishing detection system This approach uses different classifier algorithms to improve the results. Some of them are:

An approach by Castillo et al. [108] referred to as FRALEC classifies an email into ham or phishing classes. It makes use of three filters: (1) using Naïve Bayes classifier which scrutinizes the emails’ text content. (2) Classifies an email into fake, legitimate or suspicious category using non-grammatical features using rule-based classifier. (3) Emulator-based filter to re-classify the suspicious emails. This technique gives 99.8 % accurate results.

Multitier classification [109] makes use of three classifiers in a layered fashion which extracts and classifies features in a sequential manner, and the output thus obtained is sent to the decision classifier. In case of any misclassification by the upper two layers, the last classifier will make the final decision. The approach gave best results of 97 % accuracy when the sequence of classifiers used from top to bottom was: SVM, AdaBoost and Naïve Bayes, respectively.

Profiling of phishing email [110] is done by obtaining structural features to get the links embedded in emails and then to represent the emails in the form of features’ WHOIS information. To get multilabel class predictions, SVM followed by a boosting algorithm (Table 9).

Table 9 Phishing detection techniques

(V) Phishing detection by visual similarity Phishing Detection through this method refers to identifying phishing web pages by checking their resemblance with legitimate web pages [106, 111]. As we are aware of the fact that most of the phishing websites are almost same as that of their target websites, these techniques make use of the view of the web page rather than the code behind it. One such approach proposed by Fu et al. [112] used Internet Explorer to collect websites whose snapshots are then converted to 100 × 100 image; a feature vector is formed using that image which is then normalized to a number from 0 to 1. Whenever a comparison between two images is performed if the images are different it retuned normalized number is 0, and if it is 1, then they are same. If the value is between 0 and 1, then threshold values are used to categorize the web page.

The authors in [113] proposed a visual similarity-based strategy in order to identify phishing websites. The first step checks emails at the mail server for suspicious words and phrases and URLs. The second step monitors the suspicious web pages by comparing them to legitimate pages and measure their similarity with respect to layout, page style, etc. Medvet et al. [114] compared suspected web pages to legitimate pages using three features which can be text or style related that make both the pages appear to be similar. For evaluation, they used real phishing web pages along with their targets. Hara et al. [115] proposed a technique based on visual similarity. A collection of legitimate websites are used to train the classifier and stored in a database. Whenever a suspected website is found, its snapshot is compared to websites in the database, and threshold values are used for the similarity between the websites so as to get a label.

8 Open issues and challenges

Various solutions to control phishing attacks have been given in the literature. However, we can say that no solution is a “bullet of silver” against phishing. With time, phishing threat is increasing and becoming a common fraud to commit e-crime. Every time, when researchers come with any idea to control this problem, phishers change their attack strategy by exploiting vulnerabilities found in the current solution. Therefore, we can say that it is a very tight race between phishers and researchers. Phishing frauds could be committed either by social engineering or by using malicious codes. In social engineering scheme, phisher used either spoofed emails or fake websites to fool the users and commit fraud. Therefore, solutions are also based on these observations.

The blacklisting and whitelisting approaches have low FP rates and are very inefficient for the detection of zero-hour phishing attacks, i.e., these approaches are able to detect only about 20 % of such attacks. They also require communication over the network, which lowers the performance. PhishNet [79] requires high bandwidth so as to increase the blacklist. The Google safe browsing API [77] aims to lower the bandwidth requirements. In case of AIWL [80], the efficiency totally depends on how the user trains his/her browser. The machine learning and data mining approaches give the best results in phishing detection. Chandrashekharan et al. [85] used structural features with SVM to detect phishing attacks with 95 % accuracy. However, this approach is very time-consuming, even for a small dataset. The accuracy of the system using SVM can be increased up to 97 %. PILFER [104] also gives about 95 % accuracy. But the FP and FN rate show that considerable number of emails is not well classified. Similarly, robust classifier model [105] is 99.8 % accurate. But, it is a time-consuming process as it requires due to its five stages and used datasets are not standard.

Phishing detection by heuristics also gave good results. But some of them have very high FP rates, e.g., SpoofGuard [81] and PhishWish [83]. In Phishwish, since there are 11 rules to be followed, it is not adaptive to changes in the scenario. CANTINA [84] also has high FP rate in addition to its time-consuming processing. Another challenge with these approaches is the frequent update time which makes it quite expensive. User awareness is an important issue, for defense against phishing attacks. Along with an increase in the user education, some other remedies could be enhancement in the user interfaces, i.e., giving active warnings and automatically detecting malicious messages.

Recently, one of the newest areas, i.e., IoT, has also become a victim of phishing attacks. IoT is a very fast evolving architecture these days connecting every day-to-day object making our lives more comfortable. But, due to limited resources available to the IoT devices, their security mechanism is not very strong which makes them a very easy target for the attackers [116118]. In January 2014, Proofpoint unleashed the first spam and phishing attacks on IoT devices such as refrigerators and smart TVs; the attackers used these devices as a medium to send about 100,000 emails containing malwares. Once infected, the IoT devices are required to be bought offline to remove malware and those which were not are still infected. In the year 2013, 20 billion devices were connected to Internet, and this number will increase to 32 billion by the year 2020. Smart thing are the future, and everyone is appreciating it but these devices are also making the job of attackers easy [119121].

9 Conclusion

It has been almost 20 years since the phishing problem was identified. But, still it is used to steal personal information, online credentials and credit card details. There are various solutions available, but whenever a solution is proposed to overcome these attacks, phishers come up with the vulnerabilities of that solution to continue with such an attack. Phishing attacks can be classified broadly into two categories: Social engineering, which refers to acquiring user’s credentials using emails or fake websites, and malware attacks, which use malicious code or software to acquire the data required. There are several approaches to defend the user from email and website phishing and were discussed in this document.

Our survey helps new researchers to understand the history, current trends of attacks and failure of various available solutions. Defense against phishing attacks is one of the hardest challenges faced by the network security these days. A good defense mechanism should be able to detect phishing attacks with low false positives. The defense techniques discussed in this survey are blacklisting, data mining and heuristics, machine learning and soft computing algorithms. Blacklisting techniques have minimal FP rates but consume a lot of bandwidth and should be avoided if there is a possibility of zero-hour attacks. The heuristic and data mining techniques have high FP rates than blacklists with high computational costs but better at detecting zero-hour attacks. The machine learning techniques give the best results as compared to other techniques as they are able to mitigate zero-hour phishing attacks better than the other. Some of the machine learning techniques [104, 105] are able to detect TP up to 99 %.

We know that lack of awareness among the users is also a factor that relates to success of phishing attacks. Thus, educating the user is also a requirement to lower the phishing attacks, besides improvements in the interfaces that give warnings or the automatic removal of malicious content before the end-users would be a more promising approach. After the classification, we also described various issues and challenges in current solutions to understand new researcher about the idea for future study by defending against phishing attacks.