What is the lexical analyzer Java?
I created a lexical analyser in Java recently, but I don't think the performance is very good. The code works, but when I debugged the program, it took around ~100 milliseconds for only two tokens...
Can you read my code and give me tips about performance?
Lexer.java:
package me.minkizz.minlang;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;
public class Lexer {
private StringBuilder input = new StringBuilder();
private Token token;
private String lexema;
private boolean exhausted;
private String errorMessage = "";
private static SetblankChars = new HashSet ();
static {
blankChars.add('r');
blankChars.add('n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
}
public Lexer(String filePath) {
try (Streamst = Files.lines(Paths.get(filePath))) { st.forEach(input::append);
} catch (IOException ex) {
exhausted = true;
errorMessage = "Could not read file: " + filePath;
return;
}
moveAhead();
}
public void moveAhead() {
if (exhausted) {
return;
}
if (input.length() == 0) {
exhausted = true;
return;
}
ignoreWhiteSpaces();
if (findNextToken()) {
return;
}
exhausted = true;
if (input.length() > 0) {
errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
}
}
private void ignoreWhiteSpaces() {
int charsToDelete = 0;
while (blankChars.contains(input.charAt(charsToDelete))) {
charsToDelete++;
}
if (charsToDelete > 0) {
input.delete(0, charsToDelete);
}
}
private boolean findNextToken() {
for (Token t : Token.values()) {
int end = t.endOfMatch(input.toString());
if (end != -1) {
token = t;
lexema = input.substring(0, end);
input.delete(0, end);
return true;
}
}
return false;
}
public Token currentToken() {
return token;
}
public String currentLexema() {
return lexema;
}
public boolean isSuccessful() {
return errorMessage.isEmpty();
}
public String errorMessage() {
return errorMessage;
}
public boolean isExhausted() {
return exhausted;
}
}
Token.java:
package me.minkizz.minlang;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public enum Token {
PRINT_KEYWORD("print\b"), PRINTLN_KEYWORD("println\b"), OPEN_PARENTHESIS("\("), CLOSE_PARENTHESIS("\)"),
STRING(""[^"]+""), NUMBER("\d+(\.\d+)?");
private final Pattern pattern;
Token(String regex) {
pattern = Pattern.compile("^" + regex);
}
int endOfMatch(String s) {
Matcher m = pattern.matcher(s);
if (m.find()) {
return m.end();
}
return -1;
}
}
Main.java:
package me.minkizz.minlang;
public class Main {
public static void main(String[] args) {
new Main();
}
public Main() {
long start = System.nanoTime();
Interpreter.execute("C:\Users\leodu\OneDrive\Bureau\minlang.txt");
long end = System.nanoTime();
System.out
.println("Program executed in " + (end - start) + "ns (" + Math.round((end - start) / 1000000) + "ms)");
}
}
Interpreter.java:
package me.minkizz.minlang;
public class Interpreter {
private static Token previousToken;
public static void execute(String fileName) {
Lexer lexer = new Lexer(fileName);
while (!lexer.isExhausted()) {
Token token = lexer.currentToken();
String lexema = lexer.currentLexema();
if (previousToken != null) {
if (token == Token.STRING || token == Token.NUMBER) {
if (previousToken == Token.PRINT_KEYWORD) {
System.out.print(lexema);
} else if (previousToken == Token.PRINTLN_KEYWORD) {
System.out.println(lexema);
}
}
}
previousToken = token;
lexer.moveAhead();
}
}
}
Example input:
print "a"
print "b"
Here are my comments regarding the lexical analyzer Java ignoreWhiteSpaces(): instead of a loop on individual chars, can be replaced with regex to find the first char not in the list. Deleting from the StringBuilder is unnecessary. Matcher has find(int start) Now, once you adopt point 2, then you don't need StringBuilder at all. you can read the input into one line, using Files.readAllBytes() (which probably performs better than one line at a time) and just keep an index pointer that moves along the input. so, for example, ignoreWhiteSpaces() will return the index of the first non-whitespace char that is after the index pointer.