Skip to main content

Semantic Features from Web-Traffic Streams

  • Chapter
  • First Online:
Network Science and Cybersecurity

Part of the book series: Advances in Information Security ((ADIS,volume 55))

  • 3356 Accesses

Abstract

We describe a method to convert web-traffic textual streams into a set of documents in a corpus to allow use of established linguistic tools for the study of semantics, topic evolution, and token-combination signatures. A novel web-document corpus is also described which represents semantic features from each batch for subsequent analysis. A (American-English) lexicon is used to create a canonical representation of each corpus whereby there is a consistent mapping of each TermID to the corresponding lexicon-word or token. Finally, representation of a corpus member as a ‘document’ is accomplished by combining the (http) request string with the concatenation of all responses to it. This representation thus allows association of the request string tokens with the resulting content, for consumption by document classification and comparison algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. C. Wang, D. Blei, D. Heckerman, Continuous Time Dynamic Topic Models (Princeton University, Princeton, 2008)

    Google Scholar 

  2. M. Hearst, Multi-Paragraph Segmentation of Expository Text (Computer Science Division, UC Berkeley, Berkeley, 1994)

    Google Scholar 

  3. A. Jain, A. Kadav, J. Kawale, Semantic Text Segmentation and Sub-topic Extraction. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.120.7624&rep=rep1&type=pdf, 2008

  4. R. Kern, M. Granitzer, Efficient linear text segmentation based on information retrieval techniques. MEDES 2009, Lyon, France, pp. 167–171, 2009

    Google Scholar 

  5. M. Porter, An algorithm for suffix stripping. Program 14, 130–137 (1980)

    Article  Google Scholar 

  6. R. Futrelle, A. Grimes, M. Shao, Extracting structure from HTML documents for language visualization and analysis. Biological Knowledge Laboratory, College of Computer and Information Science, Northeastern University, in ICDAR (Intl. Conf. Document Analysis and Recognition), Edinburgh, 2003

    Google Scholar 

  7. P. Wittek, S. Daranyi, Spectral composition of semantic spaces, in Proceedings of QI-11, 5th International Quantum Interaction Symposium, Aberdeen, UK, 2011

    Google Scholar 

  8. D. Mochihashi, lda, a Latent Dirichlet Allocation package. NTT Communication Science Laboratories, 2004. http://chasen.org/~daiti-m/dist/lda/

  9. G. Stumme, A. Hotho, B. Berendt, Semantic Web Mining State of the Art and Future Directions (University of Kassel, Kassel, 2004)

    Google Scholar 

  10. J. Williams, S. Herrero, C. Leonardi, S. Chan, A. Sanchez, Z. Aung, Large in-memory cyber-physical security-related analytics via scalable coherent shared memory architectures. 2011 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), 2011

    Google Scholar 

  11. P. Wittek, S. Daranyi, Connecting the dots: mass, energy, word meaning, and particle-wave duality, in QI-12, 6th International Quantum Interaction Symposium, Paris, France, 2012

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steve Hutchinson .

Editor information

Editors and Affiliations

Appendices

Appendix: Sample Representations

The following excerpt illustrates typical web-traffic captured by Snort, TCPdump, and other string-oriented capture tools. These tools often add line-feeds following header records to enhance human readability. Other, binary data are rendered as ASCII characters, or as ‘.’ when the corresponding byte is not a printable character (Fig. 1).

Fig. 1
figure 1

Various lexicons used in our process to represent and analyze web-stream content. Three of the lexicons are (mostly) removed using stop-word lists. These sub-lexicons (II, III, V) would distract from semantic analyses that use T:F representations

Raw Data (from TCPdump)

Raw Traffic Strings After Tokenization

After Stop-Word Removal and Mapping to the Lexicon TermIDs

A validation display of token expansion from (T:F) back to lexicon word[T] per document. Documents #25 and #26 are shown:

WebStops

The webstops list contains many word tokens that comprise HTML markup and containers in web pages, such as tables, JavaScript functions, list structures, and style sheets. These words relate to the construction of such containers common to all web pages, and hence, are devoid of semantic content. These tokens, even though colliding with some lexicon words, must be removed so that the high frequency of these tokens does not dominate (T:F) sensitive semantic analysis algorithms like latent Dirichlet allocation.

academic

block

chars

accent

body

check

accept

bold

class

action

border

click

agent

bottom

clip

align

bounding

close

alive

box

color

application

boxes

colorful

author

browse

comma

auto

browser

common

background

bundle

compatibility

banner

button

compatible

batch

buttons

connection

before

bytes

console

begin

cache

content

bind

cancel

continue

blackboard

center

control

blank

char

cookie

character

cookies

characters

copy

. . . .

CustomStopWords

Modern web pages contain other non-lexicon words associated with JavaScript code, variable names and values. JavaScripting markup attributes also typically contain numerous key-value pairs. We observe that the values of KVPs most often contain semantically interesting content. The current process tries to retain these values, as well as other named entities, while removing non-lexicon keys and variable names. The following custom stopword list was formed after examination of a small set of web-pages, and was labeled by the author for subsequent use on this dataset. We also suggest that a more accurate and dynamic result should process the KVPs early in tokenization by splitting on the equals character (“=”). HTML and JavaScript keywords can be formally enumerated and removed. Variable names will be more difficult to determine precisely; it is likely partial-word stemming of segments may yield satisfactory performance since most often, variables consist of concatenated words, sometimes camel-cased, for self documentation. Entropy measures of the discovered components could also improve recognition of variable names.

bbnj

puvq

carin

pbtpid

panose

callout

errorh

btngradientopacity

imcspan

yvlq

abpay

unexpectedtype

validatedelete

pubi

headerbgcolor

logout

rssheadlinecell

colheader

classe

brea

codebase

iptg

serv

privacypolicy

sfri

offborder

jrskl

emihidden

regexpmatch

pollcometinterval

pickname

fieldcaption

reqrevision

headgrade

playlists

baccentmedium

mathfont

getmenubyname

substr

nbsp

clickimage

bord

sethttpmethod

nprmodpipe

active

. . . .

Semantic Analysis After LDA (Latent Dirichlet Allocation)

Treating such vectors in their pre-converted form would preserve anonymity while allowing trending, differentiation and anomaly analysis and comparisons. The following output has been expanded to the original words as a validation step to show correspondence to the original corpus documents.

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Hutchinson, S. (2014). Semantic Features from Web-Traffic Streams. In: Pino, R. (eds) Network Science and Cybersecurity. Advances in Information Security, vol 55. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7597-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-7597-2_14

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-7596-5

  • Online ISBN: 978-1-4614-7597-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics