class Linguist::Tokenizer

Generic programming language tokenizer.

Tokens are designed for use in the language bayes classifier. It strips any data strings or comments and preserves significant language symbols.

Constants

BYTE_LIMIT

Read up to 100KB

MULTI_LINE_COMMENTS

Start state on opening token, ignore anything until the closing token is reached.

SINGLE_LINE_COMMENTS

Start state on token, ignore anything till the next newline

START_MULTI_LINE_COMMENT
START_SINGLE_LINE_COMMENT

Public Class Methods

tokenize(data) click to toggle source

Public: Extract tokens from data

data - String to tokenize

Returns Array of token Strings.

# File lib/linguist/tokenizer.rb, line 15
def self.tokenize(data)
  new.extract_tokens(data)
end

Public Instance Methods

extract_sgml_tokens(data) click to toggle source

Internal: Extract tokens from inside SGML tag.

data - SGML tag String.

Examples

extract_sgml_tokens("<a href='' class=foo>")
# => ["<a>", "href="]

Returns Array of token Strings.

# File lib/linguist/tokenizer.rb, line 159
def extract_sgml_tokens(data)
  s = StringScanner.new(data)

  tokens = []

  until s.eos?
    # Emit start token
    if token = s.scan(/<\/?[^\s>]+/)
      tokens << "#{token}>"

    # Emit attributes with trailing =
    elsif token = s.scan(/\w+=/)
      tokens << token

      # Then skip over attribute value
      if s.scan(/"/)
        s.skip_until(/[^\\]"/)
      elsif s.scan(/'/)
        s.skip_until(/[^\\]'/)
      else
        s.skip_until(/\w+/)
      end

    # Emit lone attributes
    elsif token = s.scan(/\w+/)
      tokens << token

    # Stop at the end of the tag
    elsif s.scan(/>/)
      s.terminate

    else
      s.getch
    end
  end

  tokens
end
extract_shebang(data) click to toggle source

Internal: Extract normalized shebang command token.

Examples

extract_shebang("#!/usr/bin/ruby")
# => "ruby"

extract_shebang("#!/usr/bin/env node")
# => "node"

Returns String token or nil it couldn’t be parsed.

# File lib/linguist/tokenizer.rb, line 133
def extract_shebang(data)
  s = StringScanner.new(data)

  if path = s.scan(/^#!\s*\S+/)
    script = path.split('/').last
    if script == 'env'
      s.scan(/\s+/)
      script = s.scan(/\S+/)
    end
    script = script[/[^\d]+/, 0] if script
    return script
  end

  nil
end
extract_tokens(data) click to toggle source

Internal: Extract generic tokens from data.

data - String to scan.

Examples

extract_tokens("printf('Hello')")
# => ['printf', '(', ')']

Returns Array of token Strings.

# File lib/linguist/tokenizer.rb, line 57
def extract_tokens(data)
  s = StringScanner.new(data)

  tokens = []
  until s.eos?
    break if s.pos >= BYTE_LIMIT

    if token = s.scan(/^#!.+$/)
      if name = extract_shebang(token)
        tokens << "SHEBANG#!#{name}"
      end

    # Single line comment
    elsif s.beginning_of_line? && token = s.scan(START_SINGLE_LINE_COMMENT)
      # tokens << token.strip
      s.skip_until(/\n|\Z/)

    # Multiline comments
    elsif token = s.scan(START_MULTI_LINE_COMMENT)
      # tokens << token
      close_token = MULTI_LINE_COMMENTS.assoc(token)[1]
      s.skip_until(Regexp.compile(Regexp.escape(close_token)))
      # tokens << close_token

    # Skip single or double quoted strings
    elsif s.scan(/"/)
      if s.peek(1) == "\""
        s.getch
      else
        s.skip_until(/[^\\]"/)
      end
    elsif s.scan(/'/)
      if s.peek(1) == "'"
        s.getch
      else
        s.skip_until(/[^\\]'/)
      end

    # Skip number literals
    elsif s.scan(/(0x)?\d(\d|\.)*/)

    # SGML style brackets
    elsif token = s.scan(/<[^\s<>][^<>]*>/)
      extract_sgml_tokens(token).each { |t| tokens << t }

    # Common programming punctuation
    elsif token = s.scan(/;|\{|\}|\(|\)|\[|\]/)
      tokens << token

    # Regular token
    elsif token = s.scan(/[\w\.@#\/\*]+/)
      tokens << token

    # Common operators
    elsif token = s.scan(/<<?|\+|\-|\*|\/|%|&&?|\|\|?/)
      tokens << token

    else
      s.getch
    end
  end

  tokens
end