teaching machines

CS 330 Lecture 7 – Find and Replace

February 6, 2017 by . Filed under cs330, lectures, spring 2017.

Dear students,

We will focus on the final two of the three common operations we for which we will use regular expressions:

  1. Asserting that text matches a pattern.
  2. Finding all matches of a pattern in a document.
  3. Replacing all matches of a pattern with some other text.

Finding all matches of a regular expression is done with String.scan. We may want the results as an array to be processed later:

matches = text.scan(/pattern/)

If there are any capturing groups in the pattern, the results will be an array of arrays. Element 0 will be [$1, $2, ...] for the first match. And so on.

Or, we can process the array immediately by passing scan a block:

# For a pattern without capturing groups
text.scan(/pattern/) do 
  process match 
end

# For a pattern with capturing groups
text.scan(/pattern/) do |group1, group2, ...|
  process match 
end

The block that we give to scan is expected to be a doer, a void function. It is not expected to return anything.

Suppose we want to replace the text that we match. For that we can use String.gsub (for a global substitution) or String.sub (for a single substitution). gsub and sub return new strings, while gsub! and sub! modify the invoking strings. The substitution can be expressed several ways:

text.gsub!(/pattern/, 'replacement text')
text.gsub!(/pattern/, 'replacement \1 with captures \2')
text.gsub!(/pattern/, "replacement \\1 with captures \\2 and \n double quotes")
text.gsub!(/pattern/) do
  compute the replacement text, using $1, $2, ...
end

The block that we give to gsub is expected to be a returner, giving back the string that we want swapped in. It can contain arbitrary Ruby code that processes the matching text. This form of gsub is the most powerful because it is the most sensitive, which is how true power works.

Let’s write regex to do the following:

  1. List and number all the image URLs from img elements.
  2. List all the lines in a file that match a regex.
  3. Identify all the fields of study listed in a dictionary—the -ology, -nomy, and -nomics words.
  4. Locate all the string literals in a source file.
  5. Humanize identifiers, turning isUnderSiege to Is Under Siege.
  6. Fix missing quotation marks around attributes in HTML.
  7. Evaluate embedded mathematical expressions in a report.
Sincerely,

imgripper.rb

#!/usr/bin/env ruby

html = IO.read('onion.html')

html.scan(/<img.*?src="(.*?)"/).each_with_index do |groups, i|
  url = groups[0]
  puts "#{i}. #{url}"
  # system("wget #{url}")
end

studies.rb

#!/usr/bin/env ruby

dictionary = IO.read('/usr/share/dict/words')

# dictionary.scan(/.*(.)\1\1.*/) do
  # puts $&
# end

# exit 0

dictionary.scan(/.+(ology|nomy|nomics|graphy)$/) do
  # puts $`
  puts $&
  # puts $'
end

foo.src

this is some code
here's a string literal: "hey, foobag!" and another on the same "line"

here's one with a backslash: "I think \"presidents\" should wear bodycams."

another: "asdf34543 32423 dfggd!!!"

literals.rb

#!/usr/bin/env ruby

src = IO.read('foo.src')

src.scan(/"(\\"|[^"])*"/) do
  puts $&
end

humanid.rb

#!/usr/bin/env ruby

id = ARGV[0].dup

id.gsub!(/([a-z])([A-Z])/, '\1 \2')
id[0] = id[0].upcase

puts id