1 Introduction

Optical illusion is visual phenomenon in which what we see is different from the physical reality. This is not a simple mistake, because the same phenomenon occurs even when we know the real shape. In this sense, an optical illusion stems from the intrinsic nature of the human visual system, and it cannot be removed by training a neural network [3, 13].

The purpose of seeing is to understand the surrounding environment in order to perform actions such as grasping objects and avoiding obstacles. Therefore, optical illusions may cause accidents and other inconveniences [4]. For example, the slope illusion causes us to misinterpret downhill roads as uphill roads [1, 8, 18], and the Tarski illusion causes us to perceive road curves as more gentle than they are in reality [16]. These misperceptions might cause traffic accidents. Hence, it is practically important to understand the mechanisms of optical illusions, and develop methods to remove or decrease illusions as much as possible to provide safety and reliability in our daily lives.

An important class of optical illusion occurs when we see photographic images. Illusions belonging to this class include the leaning-tower illusion [7], stretching-tub illusion [7], and corridor illusion, to mention a few. In this paper we concentrate on a class of photographic images that cause misunderstanding during human communication.

Photographic images are convenient tools for visual communication. We see what we want to see even if they are located too far away to see directly. However, images are not necessarily faithful to the actual scene, and sometimes they give us wrong impression. When we visit a hotel-reservation web site, for example, we often come across an image of a room that makes the room look bigger than its actual size. Similar room-size exaggerations are found in images for various advertisements, such as for real estate properties, rental office spaces, and event halls.

It is widely known by photographers that room-size exaggeration occurs when images are taken using wide-angle lenses [15]. Photographers want to convey visual information of a room as much as possible by a single image. If they use a standard lens, they can cover only a small portion of the room. Instead, they use wide-angle lenses, which can cover a wider range of a room. However, the resulting images exaggerate the depth of the room.

These images might be regarded as false advertising, but this is often unintentional. The photographers just want to show more of the room in the advertisement. However, they also want to depict information about the room as accurately as possible. Based on this understanding, we try to find a method to show more of a room without size exaggeration.

From a physics point of view, taking an image is a purely optical process, and hence the precise appearance of a scene is recorded in the image. Indeed, if we see the image from the viewpoint corresponding to the lens center, we can see the true appearance of the scene. The room-size exaggeration occurs simply because we see the image from an inappropriate viewpoint. In this sense, the room-size exaggeration is a type of optical illusion. On the basis of this recognition, we propose a method to recover the true appearance of the same scene from images with an exaggerated room size. This paper is an extended study of the author’s previously published short abstract on this topic [17].

The rest of the paper is organized as follows. The source of the optical illusion in wide-angle images is clarified in Sect. 2, and a method for removing the illusion and recovering the true appearance of the scene is constructed in Sect. 3. The proposed method is validated using computational experiments in Sect. 4, and supporting techniques and possible applications are discussed in Sect. 5. Concluding remarks are given in Sect. 6.

2 Room-size illusion

We consider why the room size is exaggerated in an image taken by a wide-angle lens [15]. As shown in Fig. 1, let us consider a room with floor F, wall W, and ceiling C, and take its image using a camera with the lens center at P and the image plane at S. The projected images of the corner vertices are shown by small dots on S. If we place our viewpoint at P, the image has the same appearance as the original room. In this sense the projected image conveys faithful information about the room.

However, it is difficult to see the image from P, because P is too close to the image plane S. A wide-angle lens has a short focal length, which means that the distance from P to S is short. Therefore, we usually see the image from a viewpoint that is much farther from S than P. Let this viewpoint be at P\('\). Because light travels in straight lines, we extend each vertex in the image along the half line starting at P\('\) and passing through the vertex (broken lines in Fig. 1) until the object reaches its known size, and thus reconstruct the 3D structure. As shown in Fig. 1, the reconstructed floor, wall, and ceiling are F\('\), W\('\), and C\('\), respectively. We reconstruct the room in such a way that the distance between F\('\) and C\('\) is approximately the same as the known height of the ceiling of an ordinary room. As a result, we perceive the room as being farther than the actual distance, and the depth of the room appears much larger than the actual depth. This is the geometric mechanism by which the room-size illusion occurs.

Fig. 1
figure 1

Reason for the depth exaggeration

An example of this visual effect is shown in Fig. 2 using a paper model of a cubic room. A general view is shown in Fig. 2a. The front wall is removed to take images of the inside, and the ceiling is removed to make the inside bright. Figure 2b, c shows two images of this room taken from the front with different lenses. Panel (b) was taken by a wide-angle lens with 14-mm focal length for a 35-mm full-frame camera, and panel (c) was taken by a standard lens with a 70-mm focal length for a 35-mm full-frame camera. When we see these two images from the same distance, we perceive the depth of the room to be much larger in (b) than in (c). For Fig. 2b, the distance from the image to the lens center is equal to 0.39 times the width of the image. For Fig. 2c, it is equal to 1.9 times the width of the image. If we see the image from the associated lens center, we will have the same visual angle between any two points in the scene because it is equivalent to see the room directly from the camera position, and consequently we can expect to perceive the depth of the room correctly.

Fig. 2
figure 2

Cubic room and images of it taken with different focal-length lenses: a general view, b 14-mm focal-length image, c 70-mm focal-length image

As we see in Fig. 1, P is the lens center at which the image on S is taken. Geometrically the image is the central projection of the scene with the center of the projection at P and the plane of the projection at S. The stored image may be displayed on a computer screen or sheet of paper at various sizes. We assume that the image will be expanded isotropically in every direction by a common scale factor, say s. This means that if we expand the image by s in the horizontal direction, we also expand it by s in the vertical direction. At the same time, we expand the surrounding 3D space by the same factor so that the distance between P and S is also multiplied by s. We call the resulting position of P the associated lens center with respect to the expanded image.

Suppose that two points \(\textrm{Q}_1\) and \(\textrm{Q}_2\) are seen from viewpoint P. The angle formed by two half lines starting at P and passing through either \(\textrm{Q}_1\) or \(\textrm{Q}_2\) is called the visual angle of \(\textrm{Q}_1\) and \(\textrm{Q}_2\) formed at P. The visual angle between any two points in the image formed at the associated lens center is invariant under scaling and is the same as the visual angle we obtain when we see the original scene directly from the camera position.

We can state our observation in the form of the following principle:

True-Appearance Principle. An image gives the same visual angle as the original scene if and only if it is seen from the associated lens center.

On the basis of this principle, we constructed a method for recovering the true appearances of the scene from a given image.

3 Recovery of true appearance

As shown in Fig. 3, let A\('\) be the image taken by a camera with respect to the lens center at P. Because we use a wide-angle lens, the distance d from P to A\('\) is relatively short.

Fig. 3
figure 3

Transformation from a wide-angle image to standard-lens images

Suppose that we want to take another image of the same scene at the same camera position using a standard lens. The standard lens has a focal length between 50 mm and 70 mm for a 35-mm full-frame camera, which is much larger than that of the wide-angle lens. So, as shown in Fig. 3, the resulting image A will be much smaller than A\('\) if we adjust the scale in such a way that the distance between the image plane and the associated lens center is equal to d. Thus, cutting off the surrounding area in image A\('\) gives us a standard-lens image in the same view direction.

Because the wide-angle image also conveys visual information outside A, we may recover the standard-lens images that would be obtained if we change the viewing direction. We next consider a method for achieving this.

We fix an (xyz) Cartesian coordinate system, place the lens center P at the origin, and place the original wide-angle image A\('\) on the plane \(z=d\). Because the image A\('\) is parallel to the xy plane, we use symbol (xy) to represent the 2D coordinate system on A\('\), as shown in Fig. 3.

Suppose that we change the view direction by rotating it around the y axis by angle \(\alpha \), and let B be the image frame corresponding to the standard lens. Let \((X',Y',Z')\) be the coordinate system obtained from (xyz) by this rotation. Then, we have

$$\begin{aligned} X'&= x \cos \alpha + z \sin \alpha , \end{aligned}$$
(1)
$$\begin{aligned} Y'&= y, \end{aligned}$$
(2)
$$\begin{aligned} Z'&= -x \sin \alpha + z \cos \alpha . \end{aligned}$$
(3)

B is obtained from A by the same rotation. Hence, the plane containing B is represented by

$$\begin{aligned} z = {d \over {\cos \alpha }} + x \tan \alpha . \end{aligned}$$
(4)

Let Q = (xyd) be an arbitrary point in image A. The line of sight from P to Q is represented by the vector l(xyd), where l is a parameter. The value of l corresponding to the point of intersection of this line with the plane is obtained by substituting l(xyd) into Eq. (4). Thus, we get

$$\begin{aligned} ld={d \over {\cos \alpha }} + lx \sin \alpha , \end{aligned}$$
(5)

and hence

$$\begin{aligned} l= {d \over {(d-x \tan \alpha ) \cos \alpha }}. \end{aligned}$$
(6)

Thus, the point of intersection is represented by

$$\begin{aligned} {d \over {(d-x \tan \alpha ) \cos \alpha }} (x,y,d). \end{aligned}$$
(7)

This point is represented by the coordinate system \((X', Y', Z')\) as

$$\begin{aligned} X'&= {d \over {(d-x \tan \alpha ) \cos \alpha }} (x\cos \alpha + d \sin \alpha ), \end{aligned}$$
(8)
$$\begin{aligned} Y'&= {dy \over {(d-x \tan \alpha ) \cos \alpha }}, \end{aligned}$$
(9)
$$\begin{aligned} Z'&= {d \over {(d -x \tan \alpha ) \cos \alpha }} (-x \sin \alpha + d \cos \alpha ) = d. \end{aligned}$$
(10)

Next, suppose that we change the view direction by rotating around the \(X'\) axis by angle \(\beta \), and call the resulting image frame C. Let the resulting coordinate system be (XYZ). Then, we obtain

$$\begin{aligned} X&=X', \end{aligned}$$
(11)
$$\begin{aligned} Y&= Y' \cos \beta + Z' \sin \beta , \end{aligned}$$
(12)
$$\begin{aligned} Z&= -Y' \sin \beta + Z' \cos \beta . \end{aligned}$$
(13)

The plane containing C is represented by

$$\begin{aligned} Z' = {d \over {\cos \beta }} + Y' \tan \beta . \end{aligned}$$
(14)

The line of sight from P to \((X', Y', d)\) is represented by \(l' (X',Y',d)\), and the point of intersection of this line with the plane (14) satisfies

$$\begin{aligned} l' = { d \over {(Z'-Y' \tan \beta )\cos \beta }}. \end{aligned}$$
(15)

Consequently, the point of intersection is represented with respect to the (XYZ) coordinate system as

$$\begin{aligned} X&={ dX' \over {(d -Y' \tan \beta ) \cos \beta }}, \end{aligned}$$
(16)
$$\begin{aligned} Y&={ dX' \over {(d -Y' \tan \beta ) \cos \beta }} (Y' \cos \beta + d \sin \beta ), \end{aligned}$$
(17)
$$\begin{aligned} Z&=d. \end{aligned}$$
(18)

Equations (8), (9), (10) and (16), (17), (18) respectively represent the relation of the point (xyd) in the original image frame A\('\) with the corresponding point (XYd) in the rotated frame C.

To summarize the above arguments, we can construct the procedure to obtain the true appearance in the following way.

Suppose that we want to change the view direction by first rotating the viewpoint around the vertical axis by \(\alpha \) and then by rotating the viewpoint around the horizontal axis by \(\beta \), and then want to make the image corresponding to the frame C. For each point (XY) in C, we compute the corresponding point \((X',Y')\) in B by the inverse transformation of Eqs. (16) and (17),

that is,

$$\begin{aligned} X'&={ dX \over {(d +Y \tan \beta ) \cos \beta }}, \end{aligned}$$
(19)
$$\begin{aligned} Y'&={ d \over {(d +Y \tan \beta ) \cos \beta }} (Y \cos \beta - d \sin \beta ), \end{aligned}$$
(20)

and then compute the corresponding point (xy) in A\('\) by the inverse transformation of Eqs. (8) and (9), that is,

$$\begin{aligned} x&= {d \over {(d +X' \tan \alpha ) \cos \alpha }} (X'\cos \alpha - d \sin \alpha ), \end{aligned}$$
(21)
$$\begin{aligned} y&= {dY' \over {(d+ X' \tan \alpha ) \cos \alpha }}. \end{aligned}$$
(22)

We extract the color value of (xy) from A\('\) and embed it at point (XY) in C.

For actual computation, we cover frame C with a finite number of pixels. For each pixel (XY), we compute the corresponding coordinates (xy), and determine its color by collecting the pixels near (xy) and averaging their color values in A\('\).

4 Examples

This section presents the behavior of our method using wide-angle images. In those images, not only depths are exaggerated, but also the shapes of objects are distorted, especially near the margins of the image frame. We show that both the depth exaggeration and the shape distortion are removed by our method.

In this paper we present the focal length of a lens with respect to a 35-mm full-frame camera. Thus, we hereafter omit the phrase “with respect to a 35-mm full-frame camera.”

The true appearance is recovered by changing the given image in such a way that the associated lens center coincides with the viewpoint from which we see the image. Therefore, strictly speaking, we need the location of the viewpoint. Instead of measuring the viewpoint, however, we assume that the image will be seen from the viewpoint corresponding to the lens center of the so-called “standard lens.”The standard lens is a lens with a focal length between 50 mm and 70 mm. It is widely accepted that the image taken by a standard lens gives a nearly faithful appearance of the original scene. This is probably because the focal length of a standard lens is close to the distance at which we usually see the image, or at least the viewpoint is not so far from the lens center, and hence our brain can correct the distortion. Thus, we convert the given wide-angle images to standard-lens images.

4.1 Example 1: lounge

Figure 4a shows the image of a lounge of the institute to which the author belongs taken by a wide-angle lens with a 12-mm focal length. Panels (b), (c), and (d) are recovered images assuming a focal length of 70 mm.

Fig. 4
figure 4

Image of an institutional lounge: a wide-angle-lens image, b recovered image in the direction of 20 deg to the right and 30 deg upward, c recovered image in the direction of 40 deg to the right and 15 deg downward, d recovered image in the direction of 40 deg to the left and 15 deg downward

Figure 4b shows the recovered image, which assumes that the view direction is rotated 20 deg to the right and 30 deg upward. From this image, we can understand that the roof ventilation hatch is almost a square. In the original image (a), however, this hatch appears as a rectangle whose length in the depth direction is much larger than its width. This is a distortion effect of the wide-angle lens that can be removed by our method.

Figure 4c shows the image obtained if we rotate the view direction 40 deg to the right and 15 deg downward. From this image, we understand that the table is square and the chairs have a normal size for one person.

Figure 4d shows the image obtained when we rotate the viewpoint 40 deg to the left and 15 deg downward. From this image, we understand that the round tables are circular instead of elliptical, and the chairs have a normal size instead of being long seats like beds.

All the distortions we can see in (a) elongate the sizes of the objects in the depth direction. In this way, they create the impression that the depth is larger than the real size.

4.2 Example 2: utility room

Figure 5a shows an image of a utility room next to a bath taken by a wide-angle lens with 12-mm focal length. From this image, we have the impression that the center wall with the dark door meets the right wall at an obtuse angle much larger than 90 deg, and that the mirror on the left wall is larger in the horizontal direction than in the vertical direction. From this image, we can recover images corresponding to those obtained with a 70-mm standard lens.

Fig. 5
figure 5

Image of a utility room: a wide-angle-lens image, b recovered image in the direction of 30 deg to the right and 20 deg downward, c recovered image in the direction of 15 deg to the left and 20 deg upward, d recovered image in the direction of 15 deg to the left and 10 deg upward, e recovered image in the direction of 15 deg to the left, f recovered image in the direction of 40 deg to the left and 10 deg downward

Figure 5b shows the scene obtained by 30-deg rotation to the right and 20-deg rotation downward. From this image we can understand that the center wall meets the right wall orthogonally.

Figure 5c–e provides the standard-lens images we obtain when we change the view direction 15 deg to the left and respectively 20, 10, and 0 deg upward. From these three images we understand that the width of the mirror is smaller than its height.

Figure 5f is a standard-lens image we get when we rotate our viewpoint 40 deg to the left and 10 deg downward. We can understand that the soap-bubble-toy holder is circular, although it seems elliptical in Fig. 5a.

The false impression about the wall angle, the mirror shape, and the toy-holder shape together contribute To an exaggerated room depth.

4.3 Example 3: house

Figure 6a represents an image of a house taken with a 12-mm wide-angle lens. From this image, we gain an impression that the left and right walls meet at an acute angle instead of a right angle, and that the left and right roads are almost parallel. Images with similar tendencies can be found in advertisements for real estate properties. However, these impressions are false. Figure 7 shows the plan view of the house and the roads, where the dot and the arrow represent the position and orientation, respectively, of the camera by which the image was taken. The walls meet at a right angle, and the left and right roads also meet at a right angle.

Fig. 6
figure 6

Image of a house: a wide-angle-lens image, b recovered image in the direction of 15 deg to the right and 20 deg upward, c recovered image in the direction of 40 deg to the left and 25 deg downward, d recovered image in the direction of 40 deg to the right and 25 deg downward

Fig. 7
figure 7

Plan view of the house and roads in Fig. 6

Figure 6b is a 70-mm standard-lens image obtained by rotating the view direction 15 deg to the right and 20 deg upward. From this image we understand that the left and right walls meet at a right angle.

Figure 6c is a standard-lens image obtained by rotating the viewpoint 40 deg to the left and 25 deg downward. Figure 6d is the image obtained by a rotation of 40 deg to the right and 25 deg downward. From these images, we understand that the front road meets the left and right roads with obtuse angles much larger than 90 deg, and that the left and right roads are not parallel. We can also generate a video of the standard-lens appearance of the scene by continuously changing the view direction. When we see the continuous change from Fig. 6c, d, we understand that the roads meet at almost right angles.

5 Discussion

5.1 Lens-center information

Our method requires knowing the position of the lens center from which the image is taken. However, the lens-center information is usually not included when images are stored. The best way to retain that information seems to be to change our technology culture to record the lens-center information in the digital image file.

Depth-exaggerated images are widely used in advertisements, but the exaggeration is understood to not be intentional. Photographers just want to show a wide range of space in a single image, and they also want to show faithful appearances if possible. Because we are proposing a method for more faithful representation, we can expect that photographers will be willing to provide the lens-center information.

For example, a hotel-reservation web site company can record and attach the lens-center information to all images posted in the web site, and install our method so that users can see the true appearance of the room by changing the view direction. This is the ideal way to apply our method.

From the user’s point of view, however, we want to recover the true appearance even if the lens-center information is not given. In that case, we need to estimate the lens center from the image itself, which is often possible.

Ordinary rooms are rectangular, and hence have three groups of parallel lines as the intersections of the walls, floor and ceiling, and they are orthogonal to each other. A group of parallel lines in 3D space, unless the lines are parallel to the image plane of the camera, forms a set of radial lines emanating from a common point in the image. This point is called the vanishing point. Images of a rectangular room are classified into three types of perspective projections called one-point, two-point, and three-point perspective projections according to how many groups of parallel lines form the vanishing points. For the three-point perspective projection, the lens center can be determined as the intersection of three spheres having diameters defined by pairs of vanishing points [2, 9, 15]. For the two-point perspective projection, the lens center can be determined as the intersection of the sphere having the diameter connecting the two vanishing points and the line passing through the center of the image perpendicular to the image plane [15]. For the one-point perspective projection, the lens center cannot be determined from the vanishing point, and hence we need to use some other information to estimate the lens center.

The lens center can also be estimated if we have knowledge about objects in the images [10]. Machine learning techniques can also be used to estimate the lens center [19].

It is interesting to note that the lens center can be easily determined from an image taken by a 360-deg camera, if the image is recorded as the cylindrical projection. For this kind of an image, we make a cylinder by gluing the left and the right edges of the image. Then, the lens center is at the center of the cylinder. This is true no matter in what size the image is printed.

5.2 Guiding the viewpoint

As we have discussed, standard-lens images can be approximately used to match the lens center and the user’s viewpoint. We can give the true appearance more faithfully if the user’s viewpoint matches the associated lens center exactly. One method for that purpose is to provide eyeglass-like equipment, such as that shown in Fig. 8, to guide the eyes. However, such equipment might be undesirable because it inconveniences the user.

Fig. 8
figure 8

Viewpoint guide

A more flexible method might be to measure the user’s viewpoint online and adjust the size and the position of the image on the display screen dynamically in a way such that the lens center matches the user’s viewpoint. This method requires some equipment to detect the user’s eye position.

There is a trade-off between the use of equipment to measure the user’s eye position, and the use of fixed-focal-length images, such as standard-lens images. The former is more precise but constrains the users. The latter requires no action by the users but is less precise. It is one of our future research topics to investigate which option is more practical.

A head-mounted display, if it is available, can also solve the viewpoint problem. If a user wears this equipment, the eyes are fixed with respect to the display screen. Therefore, there is no need to measure the eye position.

5.3 Direction of gravity

We are sensitive to the orientation of the image in interpreting it as 3D structures [5]. In particular, the direction of gravity seems important. For example, the house in Fig. 6 has a white fence composed of vertical pickets. In the recovered images in Fig. 6c, d, however, the fence looks tilted instead of vertical. This side effect occurs because the lens axis of the camera was not horizontal. The original image in Fig. 6a was taken by a camera whose axis was directed upward. If we take an image by a horizontally directing camera, the fence would be nearly vertical in the recovered image.

For example, if we rotate the image in Fig. 6c by 10 deg counterclockwise, we get the view in Fig. 9. The orientation of the scene seems more natural in the sense that the fence looks vertical. It would be more preferable if we could adjust the orientation of the recovered image in a way such that the vertical direction of the scene is kept vertical in the image. This is possible if we know the orientation of the axis of the camera with which the image was taken. It might be also possible to estimate the camera axis orientation from the image itself. This is one of our future research topics.

Fig. 9
figure 9

Picture in Fig. 6c rotated 10 deg counterclockwise

5.4 Rectangularity preference

An image comprises 2D information, while a scene includes depth to comprise 3D data; the depth of a scene is lost in its image. Hence, from a mathematical point of view, a single image can represent various 3D scenes [6, 14]. However, we usually perceive a unique 3D scene up to a scale when we see an image. This means that our brains implicitly assume a certain class of scenes. A typical assumption is the rectangularity assumption. It is well known that the human brains prefer rectangular structures when they interpret images [11, 12]. Because of this preference, we can perceive rectangular rooms from wide-angle images even though shapes of objects are distorted in the image.

One-point perspective projection has a remarkable feature in the sense that rectangular corners of the room are perceived as rectangular without any distortion. This class of image is obtained when one group of parallel lines is parallel to the lens axis, and the other two groups of parallel lines are parallel to the image plane of the camera. The vanishing point is located at the center of the image, and the lens center is on the line passing through the vanishing point and perpendicular to the image plane. However, the distance of the lens center from the image plane cannot be determined from the vanishing point information. No matter which point on the line we locate the center of projection, the image is a mathematically correct projection of some rectangular room. Indeed, we perceive a rectangular room more naturally when we see such an image than when we see a two-point or three-point perspective projection.

To show this visual effect, Fig. 10 shows two images of the same scene; (a) is a one-point perspective projection and (b) is a two-point perspective projection. Both of them were taken by a 12-mm wide-angle lens. We can see that rectangular corners look rectangular in (a), while they are distorted in (b). For example, the steps of the staircase and the gate to the next room look considerably distorted from rectangles in (b). For this reason, one-point perspective projections are widely used in the advertisement of real estate properties. Thus, the users must be careful.

Fig. 10
figure 10

Wide-angle images of a house entrance: a one-point perspective projection, b two-point perspective projection

6 Concluding remarks

This paper points out that the depth-exaggeration effect of images used in advertisements is an optical illusion caused by the difference of the viewpoint and the lens center. In response, we developed a method to remove the illusion and thus provide a faithful appearance of the scene. Our method requires only a image and the associated lens center, and hence can be put to practical use as long as the lens center information is stored with the photograph or estimated from the image.

Depth-exaggerated images are widely used, but this situation is not desirable because consumers cannot obtain accurate information when deciding on a purchase. However, it may not be practical to prohibit exaggerated images by law. Indeed, this is related to the issue of “freedom of expression.”Photographers use wide-angle lenses to make special visual effects when they want to represent more of the actual space in a photograph. Hence, this problem is controversial.

However, providing faithful visual information about commercial goods is important and useful to both the advertisers and the consumers. Thus, it may not be difficult to establish a new culture in which images are accompanied with the associated lens center. Once this information is available routinely, our method can be used widely. Even if this lens-center information is not established, we can use various methods to estimate the lens center.

Remote shopping is becoming more and more popular, especially due to the COVID-19 pandemic. Hence, it is important to present faithful information of offered goods by images in order to raise market reliability. The proposed method may contribute to this goal.